Label Documents

For the needed volumes of documents, see Pipelines.

For more details about how to assemble a high-quality dataset, see Training High Performing Models.

Fields That Occur Multiple Times on the Same Document

There are many situations where a field appears in multiple places in the same document or even on the same page. These should all be labelled, as long as they have the same meaning.

For instance, the total amount for utility bills. It often appears at the top, within a line item list in the middle, or in a payslip at the bottom, which can be detached and sent in the mail with the check. In this situation, all three occurrences would be labelled. This is useful because in some cases, if there is an OCR error or the layout is different and one field cannot be identified, the model can still identify the other occurrences.

Note: What counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not be labelled both as total amount but only the one whose meaning is to represent the total amount.

Multiple Users Labeling in Parallel

You can have multiple users use the same instance to label at the same time, even on the same document.

If there are concurrent changes on the schema for one user, the change goes through and for the other(s), a warning message is displayed stating that the changes could not be performed. The other user(s) should immediately refresh their browser to see the changes.

Labeling for Training

When you import a dataset without checking the Make this an Evaluation set checkbox on the Import Data dialog box, then that dataset is used for training and you only need to focus on the labeling of the words (grey boxes) on the document.

If once in a while the text that gets filled in the sidebar fields is not correct, this is not a problem, as the ML model still learns. In some cases, you may need to adjust the configuration of the fields: for instance by checking the Multi-line checkbox. But in general, the main focus is on labeling the words on the page.

Labeling for Evaluation

When you import a dataset and you check the Make this an Evaluation set checkbox on the Import Data dialog, then that dataset is ignored by Training Pipelines in AI Center and used only by Evaluation Pipelines.

It is important that the correct text is filled into the fields in the sidebar (or the top bar for Column fields). This takes much longer to verify for each field, but it is the only way you get a reliable metric of the accuracy of the ML model you are building.

Starting with the 2021.10 release, Document Manager supports labeling multi-page documents. Consequently, fields in the sidebar have a single value for the entire document. This closely reflects the behavior at run time in the RPA workflow and enables Evaluation Pipelines in AI Center to produce realistic scores reflecting the real run time performance of the ML models.

However, keep in mind that this is a major change from previous releases where each page was labelled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.

Labeling Actions

See below the main actions you need to perform when labeling documents. A given field may be labelled in multiple places on the same page.

Label a Field

Select an individual text box by clicking it.

To select multiple words, click the first word and then Ctrl/Shift+click the rest of the desired words or select an entire area by dragging the mouse (the rubber banding) over it.

To unselect certain text boxes from your selection, while Ctrl/Shift is pressed, click or rubber band the unwanted text boxes again.

When your selection is accurate, tap the shortcut key to label the field.

Label a Multivalued Field

Make sure that the multivalued option of the field is selected.

Select the first batch of information and tap the shortcut key to label the field.

Repeat the steps above until all the values are labelled for the multivalued field.

Note:

Multivalued fields can be used only with Machine Learning Packages version 2022.10, or higher.
A multivalued field displays two values in its collapsed state and all values it its expanded state. Click on the expand arrow from the multivalued field to expand and visualize the list of all tagged values.

Remove a Label

Select text boxes, then press the Delete or the Backspace key on your keyboard.

Group a Table Row

After you have labelled some Column fields, and only if some rows span multiple lines of text, then you may group them together by pressing the / key to indicate that they are part of the same table row. A green box appears around the group.

When a labelled column field is grouped together, the table is parsed and displayed at the top, highlighting the extracted data.

Ungroup a Table Row

Select the group and press the / key again.

Make Corrections to the Labelled Value

Click on the text in the sidebar or the top bar and edit the content. A small lock appears to indicate the field has been manually edited. This is necessary when labeling evaluation sets.