- Release Notes
- Before you begin
- Getting started
- Installing Automation Suite
- Migration and Upgrade
- Projects
- Datasets
- ML packages
- Pipelines
- ML Skills
- ML Logs
- Document Understanding in AI Center
- How To
- Basic Troubleshooting Guide
Full pipelines
A Full Pipeline is used to train a new machine learning model and evaluate the performance of this new model, all in one go. Additionally, a preprocessing step is run before training, allowing data manipulation/training a trained machine learning model.
process_data()
, train ()
, evaluate()
and save()
functions in the train.py file). This code, together with a dataset or sub-folder within a dataset, and optionally an evaluation set, produce a new
package version, a score (the return of the evaluate()
function for the new version of the model) and any arbitrary outputs the user would like to persist in addition to the score.
Create a new full pipeline as described here. Make sure to provide the following full pipeline specific information:
- In the Pipeline type field, select Full Pipeline run.
- In the Choose input dataset field, select a dataset or folder from which you want to import data for full training. All files in this dataset/folder
should be available locally during the runtime of the pipeline at the path stored in
data_directory
variable. - Optionally, in the Choose evaluation dataset field, select a dataset or folder from which you want to import data for evaluation. All files in this dataset/folder should
be available locally during the runtime of the pipeline at the path stored in
test_data_directory
variable. If no folder is selected, it is expected that your pipeline writes something to the directorytest_data_directory
variable inprocess_data
function. If you do not select a folder, and yourprocess_data
does not write totest_data_directory
then the directory passed to the evaluate function will be empty. - In the Enter parameters section, enter the environment variables defined and used by your pipeline, if any. The environment variables are:
training_data_directory
, with default value dataset/training: Defines where the training data is accessible locally for the pipeline. This directory is used as input for thetrain()
function. Most users will never have to override this through the UI and can just write data intoos.environ['training_data_directory']
in theprocess_data
function and can just expect that the argument data_directory intrain(self, data_directory
will be called withos.environ['training_data_directory']
.-
test_data_directory
with default value dataset/test: Defines where the test data is accessible locally for the pipeline. This directory is used as input to theevaluate()
function. Most users will never have to override this through the UI and can just write data intoos.environ['test_data_directory']
in theprocess_data
function and can just expect that the argument data_directory inevaluate(self, data_directory
will be called withos.environ['test_data_directory']
. artifacts_directory
, with default value artifacts: This defines the path to a directory that will be persisted as ancillary data related to this pipeline. Most, if not all users, will never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Concretely, any data your code writes in the directory specified by the pathos.environ['artifacts_directory']
will be uploaded at the end of the pipeline run and will be viewable from the Pipeline details page.save_training_data
, with default value true: If set to true,training_data_directory
folder will be uploaded at the end of the pipeline run as an output of the pipeline under directorytraining_data_directory
.save_test_data
, with default value true: If set to true,test_data_directory
folder will be uploaded at the end of the pipeline run as an output of the pipeline under directorytest_data_directory
.
Depending on your choice to select or not an evaluation dataset, you can create full pipelines as follows:
Watch the following video to learn how to create a full pipeline with the newly trained package version 1.1. Be sure to select the same dataset (in our example, tutorialdataset) both as the input dataset and as the evaluation dataset.
Watch the following video to learn how to create a full pipeline with the newly trained package version 1.1. Select the input dataset, but leave the evaluation dataset unselected.
After the pipeline was executed, in the Pipelines page, the pipeline's status changed to Successful. The Pipeline Details page displays the arbitrary files and folders related to the pipeline run.
- The
train()
function is trained on train.csv, and not on the unaltered contents of the data folder (example1.txt, and example2.txt). Theprocess_data
can be used to dynamically split data based on any user-defined parameters. - The first full pipeline runs evaluations on a directory with example1.txt,example2.txt, and test.csv. The second full pipeline runs evaluations on a directory with only test.csv. This is the difference in not selecting an evaluation set explicitly when creating the second full pipeline run. This way you can have evaluations on new data from UiPath Robots, as well as dynamically split data already in your project.
- Each individual component can write arbitrary artifacts as part of a pipeline (histograms, tensorboard logs, distribution plots, etc.).
- The ML Package zip file is the new package version automatically generated by the training pipeline.
- Artifacts folder, only visible if not empty, is the folder regrouping all the artifacts generated by the pipeline, and it
is saved under
artifacts_directory
folder. - Training folder, only visible if
save_training_data
was set to true, is a copy of thetraining_data_directory
folder. - Test folder, only visible if
save_training_data
was set to true, is a copy of thetest_data_directory
folder.
Here is a conceptually analogous execution of a full pipeline on some package, for example version 1.1, the output of a training pipeline on version 1.0.
- Copy package version 1.1 into
~/mlpackage
. - Make a directory called
./dataset
. - Copy the contents of the input dataset into
./dataset
. - If the user set something in the Choose evaluation dataset field, copy that evaluation dataset and put it in
./dataset/test
. - Set environment variables
training_data_directory=./dataset/training
andtest_data_directory=./dataset/test
. - Execute the following python code:
from train import Main m = Main() m.process_data('./dataset') m.evaluate(os.environ['test_data_directory']) m.train(os.environ['training_data_directory']) m.evaluate(os.environ['test_data_directory'])
from train import Main m = Main() m.process_data('./dataset') m.evaluate(os.environ['test_data_directory']) m.train(os.environ['training_data_directory']) m.evaluate(os.environ['test_data_directory']) - Persist the contents of
~/mlpackage
as package version 1.2. Persist artifacts if written, snapshot data ifsave_data
is set to true.Note: The existence of the environment variablestraining_data_directory
andtest_data_directory
mean thatprocess_data
can use these variables to split data dynamically.
_results.json
file contains a summary of the pipeline run execution, exposing all inputs/outputs and execution times for a full pipeline.
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"input_data": "<storage_directory>",
"evaluation_data": "<storage_directory>/None",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"previous_score": <previous_score>, #float
"current_score": <current_score>, #float
"training_data": "<training_storage_directory>/None",
"test_data": "<test_storage_directory>/None",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"input_data": "<storage_directory>",
"evaluation_data": "<storage_directory>/None",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"previous_score": <previous_score>, #float
"current_score": <current_score>, #float
"training_data": "<training_storage_directory>/None",
"test_data": "<test_storage_directory>/None",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}