- Overview
- Getting started
- Activities
- Insights dashboards
- Document Understanding Process
- Quickstart tutorials
- Framework components
- ML packages
- Overview
- Document Understanding - ML package
- DocumentClassifier - ML package
- ML packages with OCR capabilities
- 1040 - ML package
- 1040 Schedule C - ML package
- 1040 Schedule D - ML package
- 1040 Schedule E - ML package
- 1040x - ML package
- 3949a - ML package
- 4506T - ML package
- 709 - ML package
- 941x - ML package
- 9465 - ML package
- ACORD125 - ML package
- ACORD126 - ML package
- ACORD131 - ML package
- ACORD140 - ML package
- ACORD25 - ML package
- Bank Statements - ML package
- Bills Of Lading - ML package
- Certificate of Incorporation - ML package
- Certificate of Origin - ML package
- Checks - ML package
- Children Product Certificate - ML package
- CMS 1500 - ML package
- EU Declaration of Conformity - ML package
- Financial Statements - ML package
- FM1003 - ML package
- I9 - ML package
- ID Cards - ML package
- Invoices - ML package
- Invoices Australia - ML package
- Invoices China - ML package
- Invoices Hebrew - ML package
- Invoices India - ML package
- Invoices Japan - ML package
- Invoices Shipping - ML package
- Packing Lists - ML package
- Payslips - ML package
- Passports - ML package
- Purchase Orders - ML package
- Receipts - ML Package
- Remittance Advices - ML package
- UB04 - ML package
- Utility Bills - ML package
- Vehicle Titles - ML package
- W2 - ML package
- W9 - ML package
- Other Out-of-the-box ML Packages
- Public endpoints
- Traffic limitations
- OCR Configuration
- Pipelines
- OCR services
- Supported languages
- Deep Learning
- Data and security
- Licensing
Document Understanding User Guide
Taxonomy overview
The Taxonomy is the metadata that the Document UnderstandingTM framework considers in each of its steps.
- A Taxonomy is a collection of Document Types.
- A Document Type is the definition of a logical type of document, that must be handled by different business processes. Examples of Document Types are Invoices, Medical Records, IRS Forms W-2, Contracts, etc. A document type, besides a name, group, and category (for easier handling), usually contains a collection of Fields.
- A Field is one piece of information that is expected to be found and captured from a specific Document Type.
As seen above, a Taxonomy is a hierarchical structure that contains the schema of the information the Document Understanding framework will use throughout. Each entity definition (for document types or fields) found in the Taxonomy has a unique ID.
If you want to classify incoming files into different document types, then the taxonomy should contain the document types you want to specifically treat. These will allow you to configure your document understanding processes based on a uniform data schema: the structure of your taxonomy.
If you want to extract data from certain document types, then the taxonomy will contain the list of fields that you are targeting for automatic data extraction. These will allow the configuration of various extraction methods and rules, again, based on a single source of truth data schema: the structure of your document type.
A Field may have derived parts: formatted information extracted or edited from the underlying textual value found in a document.
Field Type |
Allows Multi-Value |
Purpose |
Derived Parts for Formatting |
Additional Information |
---|---|---|---|---|
Text |
Yes |
Textual information |
N/A |
N/A |
Number |
Yes |
Numeric values |
|
N/A |
Date |
Yes |
Dates |
|
Date fields allow for the definition of an Expected Format, which must be an MSDN-compliant date format string (for example,
dd-MM-yyyy or MM, dd, yyyy ).
This format is used by the Data Extraction Scope activity when trying to parse a date into its constituent day, month, and year parts. |
Name |
Yes |
Person names |
|
N/A |
Address |
Yes |
Addresses |
|
N/A |
Set |
Yes |
Define a list of possible values from a predefined set |
N/A |
A Set field must define the allowed options as values. These are reflected in the Validation Station. |
Boolean |
Yes |
Yes/No values |
N/A |
A Boolean field can only have Yes or No as possible values, and is reflected in the Validation Station. |
Table |
No |
Tabular data |
N/A |
A Table field contains the definition of the columns. |
Table Column |
No |
Each cell in the table. |
N/A |
Table Columns in a Table field are defined as one of the regular fields in the Components list. They cannot be of Table type. |
und
) is recommended to be added, to support exceptional cases.
DocumentTaxonomy
object, the Serialize()
method returns a JSON
representation of the object, so that it can be stored and retrieved for later usage.
DocumentTaxonomy.Deserialize(jsonString)
static extension returns a DocumentTaxonomy
object, hydrated with the JSON encoded data passed as a parameter.
Once the UiPath.IntelligentOCR.Activities package is installed in your project in UiPath® Studio, a Taxonomy Manager button appears in the main ribbon of Studio's Design tab. Use the Taxonomy Manager wizard to edit your project taxonomy.
taxonomy.json
file.
The file is automatically created when you first open the Taxonomy Manager wizard. You can see the exact location of the file in the Taxonomy Manager, by hovering over the button. Alternatively, each time you open the Taxonomy Manager, a pop-up message will appear in the upper right corner, informing you of the location of the file. When a project is published from Studio, the taxonomy will be published as well as an artifact of the project.
taxonomy.json
file is unique to each project, but it can be reused if you manually copy it over to a new project. To do so, you must simply
create a new project, then go to the project folder and copy the file with the taxonomy of your choice in the right location
(in the DocumentProcessing folder).
The taxonomy for document understanding is required as an Object throughout the Document Understanding framework.
The simplest and most convenient way to load your object is by using the Load Taxonomy activity. Once your taxonomy object is loaded, you can use it in all subsequent framework components requiring it.
-
If you choose to store your taxonomy in a different location, you can still load it in your project (once you obtain the string content of the taxonomy file, let's say in a
myTaxonomyContentString
variable), by using a simple Assign activity, as follows:myTaxonomy = DocumentTaxonomy.Deserialize(myTaxonomyContentString)
- If your use case demands it, remember the Taxonomy is a POCO (plain old class object) that, when needed, can be edited even at run-time.
- What is a taxonomy
- How does it help in document classification?
- How does it help in data extraction?
- Field types and details
- Other information captured in the taxonomy
- Taxonomy extension methods
- Serialize()
- Deserialize(String)
- GetFields(String)
- How to create and edit your project's taxonomy
- How to use your taxonomy within your project
- Advanced use cases