Document Understanding User Guide

DELIVERY:

Last updated Apr 15, 2025

Data Extraction Training overview

What is Data Extraction Training

Data Extraction Training is a component in the Document Understanding^TM Framework that helps in closing the feedback loop for extractors that are capable of learning from human feedback. This would help extractors perform better on subsequent documents (depending on their own learning capabilities).

When Data Extraction Training should be used

You can build Document Understanding processes that do not contain any training component. This may occur for multiple reasons, of which some are:

the extractors you are using do not support retraining
you don't want to perform retraining as you'd rather have the process always use the same training
you want to update the extractor training offline and you are managing its updates outside of your DU process.

Training your extractors as part of regular process usage is, though, of great benefit in a majority of cases, because the extractors can gather their own training data and perform their own updates by ingesting the human validation information, without requiring you to update your already existing workflows in any way. They become, so to speak, self-learning algorithms that can teach themselves to act better in the future, based on what the humans have validated as correct data.

How to use the Data Extraction Training component

Data Extraction training is done through the Train Extractors Scope activity. You can train one or more extractors, as the scope activity has the role of configuring and executing one or more algorithms for extractor training in one go.

Data Extraction training is usually run after Data Extraction Validation: only human confirmed feedback should be sent back to the classifiers for training, to ensure accuracy of the training data received by the algorithms.

Data Extraction training should be run both in the case automatically extracted data is correct (no corrections were required), as well as in the case of human corrections. This is because both cases are useful for the algorithms to learn from.

You can train both extractors that have been used in the Data Extraction component, as well as extractors that have not been used for data extraction prediction. The latter approach is used for collecting training data and training an extractor from scratch, with the intent of later putting it to use by adding it to document understanding workflows.

In short, this is what the Train Extractors Scope does:

Provides all Extractor Trainers (training algorithms) the necessary configurations for them to run.
Accepts one or more extractor trainers.
Allows for document type and field level filtering and taxonomy mapping between the project taxonomy and any internal extractor taxonomies.

The Train Extractors Scope allows you to configure it by using the Configure Extractors wizard. You can customize

which document types and which fields are sent for training to which extractor trainer,
what is the taxonomy mapping, at document type level and field level, between the project taxonomy and the extractor's internal taxonomy (if any).

The Train Extractors Scope also allows you to uniquely identify an Extractor - Extractor Trainer pair of activities, by using the same Framework Alias string in both the Data Extraction Scope as well as the Training scope.