Document Understanding User Guide

DELIVERY:

Last updated Apr 15, 2025

Import documents

The Import data dialog box enables you to easily import new documents to be labeled or revised.

Click the Import button from the management bar.

The dialog box contains the following controls:

Batch name text field - it is mandatory to enter a name for your export, otherwise the Browse or drop files section is disabled; a valid name can have up to 24 characters and should not contain special characters.
Make this an evaluation set checkbox - if selected, the dataset is used for evaluation purposes.
Browse or drop files section - click the Browse files to upload to navigate through your directory or simply drag and drop the files inside the frame.
Status section - click (load previous import log) to see to check the status of the latest import; when uploading data, in the Status section you receive an overview of your files and you are prompted to proceed with the import by clicking YES or abort the import by clicking CANCEL.

Import types

There are 4 types of Import supported in Document Manager:

Schema import
Raw documents import (max 2000 pages and 4000 MiB per import)
Document Manager dataset import (4000 MiB per import)
Validation Station dataset import (max 2000 pages and 4000 MiB per import)

Schema import

If you would like to launch a new Document Manager session using the same schema as in an existing session, you can follow these steps:

Click the Export button from the management bar.
In the Export files dialog box, check the Schema option.
Click the Export button inside the dialog box. A .zip file is exported.
Click the Import button from the management bar.
Upload or drag & drop the .zip file directly into the new Document Manager session (do not unzip). In this step, you can also upload a predefined schema.
Click YES in the Status section to proceed with the import. The schema is imported.

Schema import can also be applied for multi-value fields.

Important: Please be aware that multi valued fields are compatible only with the models that have the version 2022.10 or higher.

Raw documents import

The types of documents that can be imported for labeling are: .pdf, .tiff, .png, .jpg.

.zip files are not supported for raw documents import.

OCR settings need to be configured before import.

Follow the steps below:

Click the Import button . The Import data dialog box is displayed.
Provide a batch name in the Batch name field. This enables you to easily filter and find these documents using the Search drop-down later on.
- If you want to use this document batch for training an ML model, leave unselected the Make this an evaluation set checkbox.
- If you want to use this document batch for evaluating an ML model (i.e. measuring its performance), select the Make this an evaluation set checkbox. This ensures the data is ignored by the Training Pipelines.
Upload or drag & drop a file or set of files into the Browse or drop files section.
Click YES. The file or set of files are imported.

Document Manager dataset import

To import a dataset that was previously labeled in another Document Manager session, you need to get the .zip file which was exported originally, and import it directly into the new Document Manager instance.

If your new Document Manager instance is completely empty (no data and no fields defined), then both the documents with labels and the schema are imported.

If your new Document Manager instance already has fields defined, then the newly imported dataset needs to have the same fields, or a subset of those fields. Otherwise, the import is rejected.

In case you export a database from an Automation Cloud™ environment, and then import it into an on-premises deployment, you need to follow these steps:

Unzip the dataset file.
Edit the scheman.json file from the archive.
Remove all display_name properties from the json file, then save it.
ZIp the dataset back, and import it into the on-premises session.

Split large datasets

To import Document Manager datasets larger than 1GB or that have more than 1500 files, we recommend you to use this script which splits the .zip files into multiple .zip files that are smaller than 1GB and that have less than 1500 files.

Validation Station dataset import

As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).

The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity and can be used to train ML models using the feature described below.

Note: For Validation Station dataset import, it is mandatory to have a schema defined.

Follow the steps below:

Configure the Machine Learning Extractor Trainer to output data into a folder with path <Trainer/Output/Folder> (use any empty folder path).
Run an RPA workflow including Validation Station and Machine Learning Extractor Trainer.
Machine Learning Extractor Trainer creates three subfolders: documents, metadata, and predictions inside of the output folder.
Zip the <Trainer/Output/Folder> to obtain a .zip file, for instance TrainerOutputFolder.zip.
Import the .zip file into Document Manager which detects that the import contains data produced by Machine Learning Extractor Trainer and imports the data accordingly.