ai-center

2021.10

false

Release Notes
Before you begin
- Installing or Upgrading AI Center
- Compatibility matrix
Getting started
Projects
- About Projects
- Managing Projects
Datasets
- About datasets
- Managing datasets
ML packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document Understanding in AI Center
- Data Manager
- OCR Services
How To
- ML Packages
  - Use Custom NER With Continuous Learning
Basic Troubleshooting Guide
- General AI Center troubleshooting and FAQs
- AI Center standalone troubleshooting

OUT OF SUPPORT

AI Center User Guide

DELIVERY:

Last updated Nov 11, 2024

Building ML packages

The data scientists build pre-trained, models using Python or using an AutoML platform. Those models are consumed by RPA developers within a workflow.

Structuring ML packages

A package must adhere to a small set of requirements. These requirements are separated into components needed for serving a model and components needed for training a model.

Serving component

A package must provide at least the following:

A folder containing a main.py file at the root of this folder.
In this file, a class called Main that implements at least two functions:
- __init__(self): takes no argument and loads your model and/or local data for the model (e.g. word embeddings).
- predict(self, input): a function to be called at model serving time and returning a String.
A file named requirements.txt with dependencies needed to run the model.

Think of the serving component of a package as the model at inference time. At serving time, a container image is created using the provided requirements.txt file, and the predict function is used as the endpoint to the model.

Training and evaluation component

In addition to inference, a package can optionally be used to train a machine learning model. This is done by providing the following:

In the same root folder with the main.py file, provide a file named train.py.
In this file, provide a class called Main that implements at least four functions. All of the below functions except __init__, are optional, but limit the type of pipelines that can be run with the corresponding package.
- __init__(self): takes no argument and loads your model and/or data for the model (e.g. word embeddings).
- train(self, training_directory): takes as input a directory with arbitrarily structured data, runs all the code necessary to train a model. This function is called whenever a training pipeline is executed.
- evaluate(self, evaluation_directory): takes as input a directory with arbitrarily structured data, runs all the code necessary to evaluate a mode, and returns a single score for that evaluation. This function is called whenever an evaluation pipeline is executed.
- save(self): takes no argument. This function is called after each call of the train function to persist your model.
- process_data(self, input_directory): takes an input_directory input with arbitrarily structured data. This function is only called whenever a full pipeline is executed. In the execution of a full pipeline, this function can perform arbitrary data transformations and it can split data. Specifically, any data saved to the path pointed to by the environment variable training_data_directory is the input to the train function, and any data saved to the path pointed to by the environment variable evaluation_data_directory is the input to the evaluation function above.

Handling data types

To make AI Center easier to use within an RPA workflow, the package can be denoted to have one of three input types: String,File and Files (set during package upload time).

String data

This is a sequence of characters. Any data that can be serialized can be used with a package. If used within an RPA workflow, the data can be serialized by the Robot (for example using a custom activity) and sent as a string. The package uploader must have selected json as the package’s input type.

The deserialization of data is done in the predict function. Below are a handful of examples for deserializing data in Python:

Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)

File data

This informs the ML Skill Activity making calls to this model to expect a path to a file. Specifically, the activity reads the file from the file system and sends it to the predict function as a serialized byte string. Thus the RPA developer can pass a path to a file, instead of having to read and serialize the file in the workflow itself.

Within the workflow, the input to the activity is just the path to the file. The activity reads the file, serializes it, and sends the file bytes to the predict function. The deserialization of data is also done in the predict function, the general case is just reading the bytes directly into a file-like object as below:

ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)

Reading the serialized bytes as above is equivalent to opening a file with the read binary flag turned on. To test the model locally, read a file as a binary file. The following shows an example of reading an image file and testing it locally:

main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))

The following shows an example of reading a csv file and using a pandas dataframe in the predict function:

main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))

Files data

This informs AI Center that the ML Skill Activity making calls to this model expects a list of file paths. As in the previous case, the activity reads and serializes each file, and sends a list of byte strings to the predict function.

A list of files can be sent to a skill. Within the workflow, the input to the activity is a string with paths to the files, separated by a comma.

When uploading a package, the data scientist selects list of files as input type. The data scientist then has to deserialize each of the sent files (as explained above). The input to the predict function is a list of bytes where each element in the list is the byte string of the file.

Persisting arbitrary data

In train.py, any executed pipeline can persist arbitrary data, called pipeline output. Any data that is written to the directory path from environment variable artifacts is persisted and can be surfaced at any point by navigating to the Pipeline Details Page. Typically, any kind of graphs, statistics of the training/evaluation jobs can be saved in artifacts directory and is accessible from the UI at the end of the pipeline run.

train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...

Using TensorFlow

During model development, the TensorFlow graph must be loaded on the same thread as the one used for serving. To do so, the default graph must be used.

Below is an example with the necessary modifications:

import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...

Information on GPU usage

When GPU is enabled at skill creation time, it is deployed on an image with NVIDIA GPU driver 418, CUDA Toolkit 10.0 and CUDA Deep Neural Network Library (cuDNN) 7.6.5 runtime library.

Examples

Simple ready-to-serve ML model with no training

In this example, the business problem does not require model retraining, thus the package must contain the serialized model IrisClassifier.sav that will be served.

Initial project tree (without main.py and requirements.txt):

IrisClassifier/
  - IrisClassifier.savIrisClassifier/
  - IrisClassifier.sav

Sample main.py to be added to the root folder:

from sklearn.externals import joblib 
import json
class Main(object):
   def __init__(self):
      self.model = joblib.load('IrisClassifier.sav')
   def predict(self, X):
      X = json.loads(X)
      result = self.model.predict_proba(X)
      return json.dumps(result.tolist())from sklearn.externals import joblib 
import json
class Main(object):
   def __init__(self):
      self.model = joblib.load('IrisClassifier.sav')
   def predict(self, X):
      X = json.loads(X)
      result = self.model.predict_proba(X)
      return json.dumps(result.tolist())

Add requirements.txt:
```
scikit-learn==0.19.0scikit-learn==0.19.0
```
Note: There are some constraints that need to be respected on pip libraries. Make sure that you can install libraries under the following constraint files:
```
itsdangerous<2.1.0
Jinja2<3.0.5
Werkzeug<2.1.0
click<8.0.0itsdangerous<2.1.0
Jinja2<3.0.5
Werkzeug<2.1.0
click<8.0.0
```
To test this, you can use the following command in a fresh environment and make sure that all libraries are properly installed:
```
pip install -r requirements.txt -c constraints.txtpip install -r requirements.txt -c constraints.txt
```

Final folder structure:

IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txtIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt

Simple ready-to-serve model with training enabled

In this example, the business problem requires the model to be retrained. Building on the package described above, you may have the following:

Initial project tree (serving-only package):

IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txtIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt

Sample train.py to be added to the root folder:

import pandas as pd 
import joblib
class Main(object): 
   def __init__(self):
       self.model_path = './IrisClassifier.sav' 
       self.model = joblib.load(self.model_path)
      
   def train(self, training_directory):
       (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
       self.model.fit(X,y)
   def evaluate(self, evaluation_directory):
       (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
       return self.model.score(X,y)
   def save(self):
       joblib.dump(self.model, self.model_path)
   def load_data(self, path):
       # The last column in csv file is the target column for prediction.
       df = pd.read_csv(path)
       X = df.iloc[:, :-1].get_values()
       y = df.iloc[:, 'y'].get_values()
       return X,yimport pandas as pd 
import joblib
class Main(object): 
   def __init__(self):
       self.model_path = './IrisClassifier.sav' 
       self.model = joblib.load(self.model_path)
      
   def train(self, training_directory):
       (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
       self.model.fit(X,y)
   def evaluate(self, evaluation_directory):
       (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
       return self.model.score(X,y)
   def save(self):
       joblib.dump(self.model, self.model_path)
   def load_data(self, path):
       # The last column in csv file is the target column for prediction.
       df = pd.read_csv(path)
       X = df.iloc[:, :-1].get_values()
       y = df.iloc[:, 'y'].get_values()
       return X,y

Edit requirements.txt if needed:

pandas==1.0.1
scikit-learn==0.19.0pandas==1.0.1
scikit-learn==0.19.0

Final folder (package) structure:
```
IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt
  - train.pyIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt
  - train.py
```
Note: This model can now be first served and, as new data points come into the system via Robot or Human-in-the-Loop, training and evaluation pipelines can be created leveraging train.py.

On this page

Structuring ML packages
Serving component
Training and evaluation component
Handling data types
String data
File data
Files data
Persisting arbitrary data
Using TensorFlow
Information on GPU usage
Examples
Simple ready-to-serve ML model with no training
Simple ready-to-serve model with training enabled

Was this page helpful?

PREVIOUSAbout ML packages

NEXTManaging ML packages

Support and Services

Get The Help You Need

UiPath Academy

Learning RPA - Automation Courses

UiPath Forum

UiPath Community Forum

Trust and Security

Cookies Policy