- Overview
- Getting started
- Activities
- Insights dashboards
- Document Understanding Process
- Quickstart tutorials
- Framework components
- ML packages
- Overview
- Document Understanding - ML package
- DocumentClassifier - ML package
- ML packages with OCR capabilities
- 1040 - ML package
- 1040 Schedule C - ML package
- 1040 Schedule D - ML package
- 1040 Schedule E - ML package
- 1040x - ML package
- 3949a - ML package
- 4506T - ML package
- 709 - ML package
- 941x - ML package
- 9465 - ML package
- ACORD125 - ML package
- ACORD126 - ML package
- ACORD131 - ML package
- ACORD140 - ML package
- ACORD25 - ML package
- Bank Statements - ML package
- Bills Of Lading - ML package
- Certificate of Incorporation - ML package
- Certificate of Origin - ML package
- Checks - ML package
- Children Product Certificate - ML package
- CMS 1500 - ML package
- EU Declaration of Conformity - ML package
- Financial Statements - ML package
- FM1003 - ML package
- I9 - ML package
- ID Cards - ML package
- Invoices - ML package
- Invoices Australia - ML package
- Invoices China - ML package
- Invoices Hebrew - ML package
- Invoices India - ML package
- Invoices Japan - ML package
- Invoices Shipping - ML package
- Packing Lists - ML package
- Payslips - ML package
- Passports - ML package
- Purchase Orders - ML package
- Receipts - ML Package
- Remittance Advices - ML package
- UB04 - ML package
- Utility Bills - ML package
- Vehicle Titles - ML package
- W2 - ML package
- W9 - ML package
- Other Out-of-the-box ML Packages
- Public endpoints
- Traffic limitations
- OCR Configuration
- Pipelines
- OCR services
- Supported languages
- Deep Learning
- Data and security
- Licensing
Document Understanding User Guide
RegEx Based Extractor
The Regex Based Extractor is the perfect tool for simple use cases, in which, for certain fields, data is always found in a strict, predictable format and context. In other words, if you have a field for which you can define a Regular Expression that is consistently good when matched, then the Regex Based Extractor is a good choice.
The activity comes with a configuration wizard that assists you in defining the regular expressions for the fields you want to target for data extraction in this way.
The activity supports both simple fields as well as table field extraction.
It is recommended to look into other extraction methods, in case there is a high variability of the context and format of the expected values. In such cases, either a Form Extractor or a Machine Learning Extractor may be better suited.
This extractor does not have learning (training) capabilities and requires up-front configuration.
The Regex Based Extractor has two major configurations to be considered:
- the Configure Regular Expressions wizard - which allows you to define regular expressions for certain fields. This wizard also makes available the Regex Editor wizard, which assists you in building your regular expressions.
- the UseVisualAlignment setting - which allows you to control whether the regular expressions configured for an extractor should be applied to the text output of the digitization component, or to a text version in which text lines are organized visually, and words are rearranged on lines based on their visual alignment.