- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Form Extractor
The Form Extractor is an extraction approach that is best suited for use cases in which non-variable format documents need to be processed, with data extracted from them. In other words, if your documents have little to no variation in the document layouts, then the Form Extractor is a good choice.
The Form Extractor relies on templates defined up-front, at the design stage, and applies a complex set of rules to apply the configured templates to incoming documents that are to be processed, thus identifying and reporting the expected information.
The activity comes with a configuration wizard that assists you in defining the templates for the document types and fields you want to target for data extraction in this way.
The activity supports both simple field as well as table field extraction.
It is recommended to look into other extraction methods, in case:
- there are many layouts that need to be handled
- documents are not only skewed, rotated, or coming
in different sizes, but also manifest "warping" (curving in certain areas).
Note:
For fixed form extraction, to evaluate if layouts of two files are the same, try overlapping them in a tool, with some transparency, to see if all non-variable content overlaps (after de-rotation, de-skewing and bringing the two images to the same scale).
If you notice variability (non-variable content appears more to the left / right / top / bottom for certain areas of the document), then the layouts are not considered the same.
The Form Extractor allows you to define multiple templates for the same document type, and, at run-time, it:
- identifies the best matching template for the incoming document and document type
- applies the template matching algorithm, based on page-level anchors, to each page that data needs to be extracted from (missing or repeating pages are not supported)
- reports the identified information from the target value areas.
It also supports fine-tuning of checkbox / boolean field processing, by allowing the configuration of "synonyms" for a "Yes" or "No" value, according to your use case.
This extractor does not have learning (training) capabilities and requires up-front configuration.
You need to use your Automation Cloud Document Understanding API Key, or host your own instance of the Form Extractor in AI Center on-premises, to use this extractor.
The Form Extractor has two major configurations to be considered:
- the Template Manager wizard - which allows you to define templates to be applied to incoming documents. This wizard also makes available the Template Editor wizard, along with the Boolean field interpretation settings.
- the MinOverlapPercentage setting - which allows you to control how strict the value area matching should be. It accepts a value between 0 and 100, and it controls what words are accepted or rejected from being part of a given value, based on how well their location fits the area defined in the template.
This wizard allows you to create, edit, manage, and export/import templates for the document types defined in the taxonomy.
Creating a template
- Add a Form Extractor activity to your workflow, within a Data Extraction Scope.
- Configure the extractor by clicking on the
Manage Templates button.
- The Template Manager window opens.
- The Template Manager window opens.
- Click the Create Template button for
creating a new template.
- Select the document type you are defining the
template for, from the Document Type drop-down list.
Note: All Document Types are based on the Taxonomy. Make sure to add or create a Taxonomy inside the project's folder.
- Add the name of the template in the Template name field. Make sure it is a relevant name that represents what version of the document, or what layout, you are capturing and configuring through it.
- Add the document path in the Template
document field.
- Navigate to the file's path by using the Browse button.
- Select an OCR from the OCR Engine drop-down list, and configure it according to its needs.
- Click the Configure button to trigger template editing.
The OCR engine is applied only if necessary. If the document selected for building a template is a Native PDF, then no OCR engine is executed.
Each OCR engine comes with its own set of custom options. Here you can find more details about all options available for each OCR engine.
If you already created a template, then it can be edited, exported, or removed.
Delete and Export buttons become available only when at least a template is selected. The Edit and Remove options for an individual template are always available.
Configuring Boolean Field Processing
Yes
or No
reported value.
Exporting and Importing Templates
You can import templates created and exported from other workflows. Use these features to share templates between projects, so that once a document type is configured using the Form Extractor, you won't need to re-configure the templates in a new implementation.
Exporting Procedure
Here are the steps you need to follow to export a template:
- Create one or more templates by following the steps explained at the beginning of this page.
- Select the templates you want to export.
- Select an Export option (with or without
the original files) as shown in the below screenshot. Exporting with original files
attaches them to the export. The second option doesn't attach the files used for template creation.
- Save the template's archive with the desired name.
- A message is displayed once the template is
saved. Select the OK button.
Note:If you cannot share the content of the documents you have built your templates on, then use the "Without Original Files" option. You will still be able to share and import the template archive in other projects, but you will not be able to edit or view them anymore.
If you want to be able to edit the templates once imported in a different project, make sure to use the "With Original Files" option when exporting and then importing them.
Importing Procedure
Here are the steps you need to follow to import a template:
- Select the Import button.
- Select an archive. The import wizard appears and
presents all document types and all templates available in the selected export archive.
Select the templates you wish to import and choose the right Import option (with or
without the original files).
Note:- When templates are imported, document types are created automatically in the project's Taxonomy. If a document type with the same name already exists, another one is created by appending a count to the document type name.
- If you are importing templates that have been exported without the original files, or if you choose to import templates without the original files, then you have no view or edit options for those templates.
Special Situations when Importing a Template
When a template is imported, several special situations might occur. The below table explains each situation and its particularities:
Import Type |
Activity Behavior |
---|---|
New document type |
If a new document type is imported, then a new field is added in the wizard configurator, informing you that a new template is to be created. |
Duplicate document type |
If an identical document type is imported, then the following warning message appears:
|
Extended template |
If a document type template that includes extra fields than the already existing one is imported, then the following warning message appears:
|
Extended document type |
If the user imports a document type that includes extra fields than the already existing one, then the following warning message appears:
|
Document type with identical name but different content |
If the user imports a document type that has the same name as the existing one but different fields, then the following warning message appears:
|
Document type with missing table |
If the user imports a document type that doesn't include a table, then the following warning message appears:
|
Document type with extended table |
If the user imports a document type that includes a table with extra columns, then the following warning message appears:
|
Document type with reduced table |
If the user imports a document type that includes a table with missing columns, then the following warning message appears:
|
Table template with different document types |
If the user imports a document type template that includes a table with different document types, then a new template is created. If your taxonomy includes a table that has a field with a different document type, then the following message appears:
|
General Considerations
The Template Editor is built on top of the functionality present in the Validation Station.
To learn about the basic usage of the Validation station, read this.
Configuring Page Level Anchors
When defining or editing a template, the first thing that needs to be performed is the Page 1 Matching Info selection, for fixed form template definition.
This field that appears on the left side of the screen as the first field needs to be configured with words (tokens only are accepted) from the first page of the template, that are constantly in the same position within that particular template layout, that form a unique graph of words (considering relative distances and angles between words) across all the templates defined for a particular document type. In other words, the Page 1 Matching Info (and all other Page Matching Info fields) are "fingerprints" of a particular page and are extensively used in identifying the right matching template at run-time.
For this reason, for the Page 1 Matching Info field, it is strongly recommended to select 10 to 20 words, preferably longer, spread across the entire page area, that would form a unique pattern across all defined templates for that document type.
The other Page Matching Info fields (one for each template page) must be filled in only if you are attempting data extraction from that particular page, and do not require cross-template uniqueness anymore. If no fields need to be extracted from a particular page, defining the page level matching info for that page is not mandatory.
Configuring Simple Fields
For all fields other than Tables, configuring the template consists of selecting a Custom Area and assigning it to a particular field.
For fixed-form configurations, data fields can only be configured using Custom Area selections.
For any field, you can define one or more such Custom Areas, using the (+) button. If you define two or more custom areas for a single field, then at run-time, if the field is defined in the Taxonomy as Single Value, then all values from all the custom areas will be concatenated into a single reported value. If on the other hand the field is defined as Multi Value, then each value from each custom area will be reported individually.
The below animation shows the difference between a Tokens or Custom Area selection:
You can also find out the type of accepted selection for each field by verifying the icon beside each field as shown in the below animation:
If an empty area is selected, the selection is automatically set as Custom area. If text is detected inside the selected area, you are asked to choose the type of the selection between Tokens or Custom area.
Use the Validation Station "selection mode" feature to lock your selection between Tokens and Custom Areas.
Configuring Tables
As mentioned above, there are fields where information can be added only by using tokens (like the Page Matching Info fields) or only by using a custom area (like simple fields). For Table fields, you can
- define each cell one by one, once the Table Editor is expanded - by adding Custom Area selection to each cell individually, or
- use the table markup functionality - by marking the table area, drawing row and column separators, and then assigning the thus marked table to the field.
Check the animation below for learning how to use the table markup functionality: