ai-center
2021.10
false
AI Center User Guide
Automation CloudAutomation SuiteStandalone
Last updated Jun 6, 2024

Building ML packages

The data scientists build pre-trained, models using Python or using an AutoML platform. Those models are consumed by RPA developers within a workflow.

Structuring ML packages

A package must adhere to a small set of requirements. These requirements are separated into components needed for serving a model and components needed for training a model.

Serving component

A package must provide at least the following:
  • A folder containing a main.py file at the root of this folder.
  • In this file, a class called Main that implements at least two functions:
    • __init__(self): takes no argument and loads your model and/or local data for the model (e.g. word embeddings).
    • predict(self, input): a function to be called at model serving time and returning a String.
  • A file named requirements.txt with dependencies needed to run the model.
Think of the serving component of a package as the model at inference time. At serving time, a container image is created using the provided requirements.txt file, and the predict function is used as the endpoint to the model.

Training and evaluation component

In addition to inference, a package can optionally be used to train a machine learning model. This is done by providing the following:
  • In the same root folder with the main.py file, provide a file named train.py.
  • In this file, provide a class called Main that implements at least four functions. All of the below functions except __init__, are optional, but limit the type of pipelines that can be run with the corresponding package.
    • __init__(self): takes no argument and loads your model and/or data for the model (e.g. word embeddings).
    • train(self, training_directory): takes as input a directory with arbitrarily structured data, runs all the code necessary to train a model. This function is called whenever a training pipeline is executed.
    • evaluate(self, evaluation_directory): takes as input a directory with arbitrarily structured data, runs all the code necessary to evaluate a mode, and returns a single score for that evaluation. This function is called whenever an evaluation pipeline is executed.
    • save(self): takes no argument. This function is called after each call of the train function to persist your model.
    • process_data(self, input_directory): takes an input_directory input with arbitrarily structured data. This function is only called whenever a full pipeline is executed. In the execution of a full pipeline, this function can perform arbitrary data transformations and it can split data. Specifically, any data saved to the path pointed to by the environment variable training_data_directory is the input to the train function, and any data saved to the path pointed to by the environment variable evaluation_data_directory is the input to the evaluation function above.

Handling data types

To make AI Center easier to use within an RPA workflow, the package can be denoted to have one of three input types: String,File and Files (set during package upload time).

String data

This is a sequence of characters. Any data that can be serialized can be used with a package. If used within an RPA workflow, the data can be serialized by the Robot (for example using a custom activity) and sent as a string. The package uploader must have selected json as the package’s input type.
The deserialization of data is done in the predict function. Below are a handful of examples for deserializing data in Python:
Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)

File data

This informs the ML Skill Activity making calls to this model to expect a path to a file. Specifically, the activity reads the file from the file system and sends it to the predict function as a serialized byte string. Thus the RPA developer can pass a path to a file, instead of having to read and serialize the file in the workflow itself.
Within the workflow, the input to the activity is just the path to the file. The activity reads the file, serializes it, and sends the file bytes to the predict function. The deserialization of data is also done in the predict function, the general case is just reading the bytes directly into a file-like object as below:
ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)

Reading the serialized bytes as above is equivalent to opening a file with the read binary flag turned on. To test the model locally, read a file as a binary file. The following shows an example of reading an image file and testing it locally:

main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))
The following shows an example of reading a csv file and using a pandas dataframe in the predict function:
main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))

Files data

This informs AI Center that the ML Skill Activity making calls to this model expects a list of file paths. As in the previous case, the activity reads and serializes each file, and sends a list of byte strings to the predict function.

A list of files can be sent to a skill. Within the workflow, the input to the activity is a string with paths to the files, separated by a comma.

When uploading a package, the data scientist selects list of files as input type. The data scientist then has to deserialize each of the sent files (as explained above). The input to the predict function is a list of bytes where each element in the list is the byte string of the file.

Persisting arbitrary data

In train.py, any executed pipeline can persist arbitrary data, called pipeline output. Any data that is written to the directory path from environment variable artifacts is persisted and can be surfaced at any point by navigating to the Pipeline Details Page. Typically, any kind of graphs, statistics of the training/evaluation jobs can be saved in artifacts directory and is accessible from the UI at the end of the pipeline run.
train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...

Using TensorFlow

During model development, the TensorFlow graph must be loaded on the same thread as the one used for serving. To do so, the default graph must be used.

Below is an example with the necessary modifications:

import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...

Information on GPU usage

When GPU is enabled at skill creation time, it is deployed on an image with NVIDIA GPU driver 418, CUDA Toolkit 10.0 and CUDA Deep Neural Network Library (cuDNN) 7.6.5 runtime library.

Examples

Simple ready-to-serve ML model with no training

In this example, the business problem does not require model retraining, thus the package must contain the serialized model IrisClassifier.sav that will be served.
  1. Initial project tree (without main.py and requirements.txt):
    IrisClassifier/
      - IrisClassifier.savIrisClassifier/
      - IrisClassifier.sav
  2. Sample main.py to be added to the root folder:
    from sklearn.externals import joblib 
    import json
    class Main(object):
       def __init__(self):
          self.model = joblib.load('IrisClassifier.sav')
       def predict(self, X):
          X = json.loads(X)
          result = self.model.predict_proba(X)
          return json.dumps(result.tolist())from sklearn.externals import joblib 
    import json
    class Main(object):
       def __init__(self):
          self.model = joblib.load('IrisClassifier.sav')
       def predict(self, X):
          X = json.loads(X)
          result = self.model.predict_proba(X)
          return json.dumps(result.tolist())
  3. Add requirements.txt:
    scikit-learn==0.19.0scikit-learn==0.19.0
    Note: There are some constraints that need to be respected on pip libraries. Make sure that you can install libraries under the following constraint files:
    itsdangerous<2.1.0
    Jinja2<3.0.5
    Werkzeug<2.1.0
    click<8.0.0itsdangerous<2.1.0
    Jinja2<3.0.5
    Werkzeug<2.1.0
    click<8.0.0
    To test this, you can use the following command in a fresh environment and make sure that all libraries are properly installed:
    pip install -r requirements.txt -c constraints.txtpip install -r requirements.txt -c constraints.txt
  4. Final folder structure:
    IrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txtIrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txt

Simple ready-to-serve model with training enabled

In this example, the business problem requires the model to be retrained. Building on the package described above, you may have the following:

  1. Initial project tree (serving-only package):
    IrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txtIrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txt
  2. Sample train.py to be added to the root folder:
    import pandas as pd 
    import joblib
    class Main(object): 
       def __init__(self):
           self.model_path = './IrisClassifier.sav' 
           self.model = joblib.load(self.model_path)
          
       def train(self, training_directory):
           (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
           self.model.fit(X,y)
       def evaluate(self, evaluation_directory):
           (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
           return self.model.score(X,y)
       def save(self):
           joblib.dump(self.model, self.model_path)
       def load_data(self, path):
           # The last column in csv file is the target column for prediction.
           df = pd.read_csv(path)
           X = df.iloc[:, :-1].get_values()
           y = df.iloc[:, 'y'].get_values()
           return X,yimport pandas as pd 
    import joblib
    class Main(object): 
       def __init__(self):
           self.model_path = './IrisClassifier.sav' 
           self.model = joblib.load(self.model_path)
          
       def train(self, training_directory):
           (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
           self.model.fit(X,y)
       def evaluate(self, evaluation_directory):
           (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
           return self.model.score(X,y)
       def save(self):
           joblib.dump(self.model, self.model_path)
       def load_data(self, path):
           # The last column in csv file is the target column for prediction.
           df = pd.read_csv(path)
           X = df.iloc[:, :-1].get_values()
           y = df.iloc[:, 'y'].get_values()
           return X,y
  3. Edit requirements.txt if needed:
    pandas==1.0.1
    scikit-learn==0.19.0pandas==1.0.1
    scikit-learn==0.19.0
  4. Final folder (package) structure:
    IrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txt
      - train.pyIrisClassifier/
      - IrisClassifier.sav
      - main.py
      - requirements.txt
      - train.py
    Note: This model can now be first served and, as new data points come into the system via Robot or Human-in-the-Loop, training and evaluation pipelines can be created leveraging train.py.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.