Document Understanding User Guide

DELIVERY:

Last updated Feb 4, 2025

Label Documents

For the needed volumes of documents, see Pipelines.

For more details about how to assemble a high-quality dataset, see Training High Performing Models.

Fields That Occur Multiple Times on the Same Document

There are many situations where a field appears in multiple places in the same document or even on the same page. These should all be labeled, as long as they have the same meaning.

For instance, the total amount for utility bills. It often appears at the top, within a line item list in the middle, or in a payslip at the bottom, which can be detached and sent in the mail with the check. In this situation, all three occurrences would be labeled. This is useful because in some cases, if there is an OCR error or the layout is different and one field cannot be identified, the model can still identify the other occurrences.

Note: What counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not be labeled both as total amount but only the one whose meaning is to represent the total amount.

Multiple Users Labeling in Parallel

You can have multiple users use the same instance to label at the same time, even on the same document.

If there are concurrent changes on the schema for one user, the change goes through and for the other(s), a warning message is displayed stating that the changes could not be performed. The other user(s) should immediately refresh their browser to see the changes.

Labeling for Training

When you import a dataset without checking the Make this an Evaluation set checkbox on the Import Data dialog box, then that dataset is used for training and you only need to focus on the labeling of the words (grey boxes) on the document.

If once in a while the text that gets filled in the sidebar fields is not correct, this is not a problem, as the ML model still learns. In some cases, you may need to adjust the configuration of the fields: for instance by checking the Multi-line checkbox. But in general, the main focus is on labeling the words on the page.

Labeling for Evaluation

When you import a dataset and you check the Make this an Evaluation set checkbox on the Import Data dialog, then that dataset is ignored by Training Pipelines in AI Center and used only by Evaluation Pipelines.

It is important that the correct text is filled into the fields in the sidebar (or the top bar for Column fields). This takes much longer to verify for each field, but it is the only way you get a reliable metric of the accuracy of the ML model you are building.

Starting with the 2021.10 release, Data Manager supports labeling multi-page documents. Consequently, fields in the sidebar have a single value for the entire document. This closely reflects the behavior at run time in the RPA workflow and enables Evaluation Pipelines in AI Center to produce realistic scores reflecting the real run time performance of the ML models.

However, keep in mind that this is a major change from previous releases where each page was labeled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.

Labeling Actions

See below the main actions you need to perform when labeling documents. A given field may be labeled in multiple places on the same page.

Label a Field

Select an individual text box by clicking it.

To select multiple words, click the first word and then Ctrl/Shift+click the rest of the desired words or select an entire area by dragging the mouse (the rubber banding) over it.

To unselect certain text boxes from your selection, while Ctrl/Shift is pressed, click or rubberband the unwanted text boxes again.

When your selection is accurate, tap the shortcut key to label the field.

Remove a Label

Select text boxes, then press the Delete or the Backspace key on your keyboard.

Group a Table Row

After you have labeled some Column fields, and only if some rows span multiple lines of text, then you may group them together by pressing the / key to indicate that they are part of the same table row. A green box appears around the group.

Ungroup a Table Row

Select the group and press the / key again.

Make Corrections to the Labeled Value

Click on the text in the sidebar or the top bar and edit the content. A small lock appears to indicate the field has been manually edited. This is necessary when labeling evaluation sets.