- Release Notes
- Before you begin
- Getting started
- Installing Automation Suite
- Migration and upgrade
- Projects
- Datasets
- ML packages
- Pipelines
- Training pipelines
- Evaluation pipelines
- Full pipelines
- Managing pipelines
- Closing the loop
- ML Skills
- ML Logs
- Document Understanding in AI Center
- Licensing
- How To
- Basic Troubleshooting Guide
Training pipelines
train()
function in the train.py file) and code to persist a newly trained model (the save()
function in the train.py file). These, together with a dataset or sub-folder within a dataset, produce a new package version.
Create a new training pipeline as described here. Make sure to provide the following training pipeline specific information:
- In the Pipeline type field, select Training run.
- In the Choose input dataset field, select a dataset or folder from which you want to import data for training. All files in this dataset/folder should
be available locally during the runtime of the pipeline, being passed to the first argument to your
train()
function (that is, the path to the mounted data will be passed to the data_directory variable in the definition train(self, data_directory)). - In the Enter parameters section, enter any environment variables defined and used by your pipeline, if any. The environment variables that are set
by default are:
artifacts_directory
, with default value artifacts: This defines the path to a directory that is persisted as ancillary data related to this pipeline. Most, if not all users, never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Specifically, any data your code writes in the directory specified by the pathos.environ['artifacts_directory']
is uploaded at the end of the pipeline run and will be viewable from the Pipeline details page.save_training_data
, with default value false: If set to true, the folder chosen inChoose input dataset
is uploaded at the end of the pipeline run as an output of the pipeline under directorydata_directory
.
Watch the following video to learn how to create a training pipeline:
After the pipeline was executed, a new minor version of the package is available and displayed in the ML Packages > [Package Name] page. In our example, this is package version 1.1.
my-training-artifact.txt
.
Here is a conceptually analogous execution of a training pipeline on some package, for example version 1.0.
- Copy package version 1.0 into
~/mlpackage
. - Copy the input dataset or the dataset subfolder selected from the UI to
~/mlpackage/data
. - Execute the following python code:
from train import Main m = Main() m.train(‘./data’) m.save()
from train import Main m = Main() m.train(‘./data’) m.save() - Persist the contents of
~/mlpackage
as package version 1.1. Persist artifacts if written, snapshot data ifsave_data
is set to true.
_results.json
file contains a summary of the pipeline run execution, exposing all inputs/outputs and execution times for a training pipeline.
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"train_data": "<storage_directory>",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"train_data": "<test_storage_directory>",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"train_data": "<storage_directory>",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"train_data": "<test_storage_directory>",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}
The ML Package zip file is the new package version automatically generated by the training pipeline.
artifacts_directory
folder.
save_data
was set to the default true value, is a copy of the input dataset folder.
Governance in machine learning is something very few companies are equipped to handle. In allowing each model to take a snapshot of the data it was trained on, AI Center enables companies to have data traceability.
save_training_data
= true
, which takes a snapshot of the data passed in as input. Thereafter, a user can always navigate to the corresponding Pipeline Details page to see exactly what data was used at training time.