- Release Notes
- Before you begin
- Getting started
- Installing Automation Suite
- Migration and upgrade
- Projects
- Datasets
- ML packages
- Pipelines
- ML Skills
- ML Logs
- Document Understanding in AI Center
- Licensing
- How To
- Basic Troubleshooting Guide
Building ML packages
The data scientists build pre-trained, models using Python or using an AutoML platform. Those models are consumed by RPA developers within a workflow.
A package must adhere to a small set of requirements. These requirements are separated into components needed for serving a model and components needed for training a model.
- A folder containing a main.py file at the root of this folder.
- In this file, a class called Main that implements at least two functions:
__init__(self)
: takes no argument and loads your model and/or local data for the model (e.g. word embeddings).predict(self, input)
: a function to be called at model serving time and returning a String.
- A file named requirements.txt with dependencies needed to run the model.
predict
function is used as the endpoint to the model.
- In the same root folder with the main.py file, provide a file named train.py.
- In this file, provide a class called Main that implements at least four functions. All of the below functions except
__init__
, are optional, but limit the type of pipelines that can be run with the corresponding package.__init__(self)
: takes no argument and loads your model and/or data for the model (e.g. word embeddings).train(self, training_directory)
: takes as input a directory with arbitrarily structured data, runs all the code necessary to train a model. This function is called whenever a training pipeline is executed.evaluate(self, evaluation_directory)
: takes as input a directory with arbitrarily structured data, runs all the code necessary to evaluate a mode, and returns a single score for that evaluation. This function is called whenever an evaluation pipeline is executed.save(self)
: takes no argument. This function is called after each call of thetrain
function to persist your model.process_data(self, input_directory)
: takes aninput_directory
input with arbitrarily structured data. This function is only called whenever a full pipeline is executed. In the execution of a full pipeline, this function can perform arbitrary data transformations and it can split data. Specifically, any data saved to the path pointed to by the environment variabletraining_data_directory
is the input to thetrain
function, and any data saved to the path pointed to by the environment variableevaluation_data_directory
is the input to theevaluation
function above.
To make AI Center easier to use within an RPA workflow, the package can be denoted to have one of three input types: String,File and Files (set during package upload time).
json
as the package’s input type.
predict
function. Below are a handful of examples for deserializing data in Python:
Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
example = skill_input # No extra processing
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
import json
example = json.loads(skill_input)
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
import json
import numpy as np
example = np.array(json.loads(skill_input))
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
# "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
import pandas as pd
example = pd.read_json(skill_input)
Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
example = skill_input # No extra processing
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
import json
example = json.loads(skill_input)
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
import json
import numpy as np
example = np.array(json.loads(skill_input))
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
# "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
import pandas as pd
example = pd.read_json(skill_input)
predict
function as a serialized byte string. Thus the RPA developer can pass a path to a file, instead of having to read and serialize
the file in the workflow itself.
predict
function. The deserialization of data is also done in the predict
function, the general case is just reading the bytes directly into a file-like object as below:
ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
import io
file_like = io.BytesIO(skill_input)
ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
import io
file_like = io.BytesIO(skill_input)
Reading the serialized bytes as above is equivalent to opening a file with the read binary flag turned on. To test the model locally, read a file as a binary file. The following shows an example of reading an image file and testing it locally:
main.py where model input is an image
class Main(object):
...
def predict(self, skill_input):
import io
from PIL import Image
image = Image.open(io.BytesIO(skill_input))
...
if__name__ == '_main_':
# Test the ML Package locally
with open('./image-to-test-locally.png', 'rb') as input_file:
file_bytes = input_file.read()
m = Main()
print(m.predict(file bytes))
main.py where model input is an image
class Main(object):
...
def predict(self, skill_input):
import io
from PIL import Image
image = Image.open(io.BytesIO(skill_input))
...
if__name__ == '_main_':
# Test the ML Package locally
with open('./image-to-test-locally.png', 'rb') as input_file:
file_bytes = input_file.read()
m = Main()
print(m.predict(file bytes))
csv
file and using a pandas dataframe in the predict
function:
main.py where model input is a csv file
class Main(object):
...
def predict(self, skill_input):
import pandas as pd
data frame = pd.read_csv(io.BytesIO(skill_input))
...
if name == '_main_':
# Test the ML Package locally
with open('./csv—to—test—locally.csv', 'rb') as input_file:
bytes = input_file.read()
m = Main()
print(m.predict(bytes))
main.py where model input is a csv file
class Main(object):
...
def predict(self, skill_input):
import pandas as pd
data frame = pd.read_csv(io.BytesIO(skill_input))
...
if name == '_main_':
# Test the ML Package locally
with open('./csv—to—test—locally.csv', 'rb') as input_file:
bytes = input_file.read()
m = Main()
print(m.predict(bytes))
predict
function.
A list of files can be sent to a skill. Within the workflow, the input to the activity is a string with paths to the files, separated by a comma.
predict
function is a list of bytes where each element in the list is the byte string of the file.
train.py
, any executed pipeline can persist arbitrary data, called pipeline output. Any data that is written to the directory path
from environment variable artifacts is persisted and can be surfaced at any point by navigating to the Pipeline Details Page. Typically, any kind of graphs, statistics of the training/evaluation jobs can be saved in artifacts
directory and is accessible from the UI at the end of the pipeline run.
train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
...
def process_data(self, data_directory):
d = pd.read_csv(os.path.join(data_directory, 'data.csv'))
d = self.clean_data(d)
d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
...
def save_artifacts(self, data, file_name, artifact_directory):
plot = data.hist()
fig = plot[0][0].get_figure()
fig.savefig(os.path.join(artifact_directory, file_name))
...
train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
...
def process_data(self, data_directory):
d = pd.read_csv(os.path.join(data_directory, 'data.csv'))
d = self.clean_data(d)
d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
...
def save_artifacts(self, data, file_name, artifact_directory):
plot = data.hist()
fig = plot[0][0].get_figure()
fig.savefig(os.path.join(artifact_directory, file_name))
...
During model development, the TensorFlow graph must be loaded on the same thread as the one used for serving. To do so, the default graph must be used.
Below is an example with the necessary modifications:
import tensorflow as tf
class Main(object):
def __init__(self):
self.graph = tf.get_default_graph() # Add this line
...
def predict(self, skill_input):
with self.graph.as_default():
...
import tensorflow as tf
class Main(object):
def __init__(self):
self.graph = tf.get_default_graph() # Add this line
...
def predict(self, skill_input):
with self.graph.as_default():
...
When GPU is enabled at skill creation time, it is deployed on an image with NVIDIA GPU driver 418, CUDA Toolkit 10.0 and CUDA Deep Neural Network Library (cuDNN) 7.6.5 runtime library.
IrisClassifier.sav
that will be served.
- Initial project tree (without main.py and requirements.txt):
IrisClassifier/ - IrisClassifier.sav
IrisClassifier/ - IrisClassifier.sav - Sample main.py to be added to the root folder:
from sklearn.externals import joblib import json class Main(object): def __init__(self): self.model = joblib.load('IrisClassifier.sav') def predict(self, X): X = json.loads(X) result = self.model.predict_proba(X) return json.dumps(result.tolist())
from sklearn.externals import joblib import json class Main(object): def __init__(self): self.model = joblib.load('IrisClassifier.sav') def predict(self, X): X = json.loads(X) result = self.model.predict_proba(X) return json.dumps(result.tolist()) - Add requirements.txt:
scikit-learn==0.19.0
scikit-learn==0.19.0Note: There are some constraints that need to be respected on pip libraries. Make sure that you can install libraries under the following constraint files:To test this, you can use the following command in a fresh environment and make sure that all libraries are properly installed:itsdangerous<2.1.0 Jinja2<3.0.5 Werkzeug<2.1.0 click<8.0.0
itsdangerous<2.1.0 Jinja2<3.0.5 Werkzeug<2.1.0 click<8.0.0pip install -r requirements.txt -c constraints.txt
pip install -r requirements.txt -c constraints.txt - Final folder structure:
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt
In this example, the business problem requires the model to be retrained. Building on the package described above, you may have the following:
- Initial project tree (serving-only package):
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt - Sample train.py to be added to the root folder:
import pandas as pd import joblib class Main(object): def __init__(self): self.model_path = './IrisClassifier.sav' self.model = joblib.load(self.model_path) def train(self, training_directory): (X,y) = self.load_data(os.path.join(training_directory, 'train.csv')) self.model.fit(X,y) def evaluate(self, evaluation_directory): (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv')) return self.model.score(X,y) def save(self): joblib.dump(self.model, self.model_path) def load_data(self, path): # The last column in csv file is the target column for prediction. df = pd.read_csv(path) X = df.iloc[:, :-1].get_values() y = df.iloc[:, 'y'].get_values() return X,y
import pandas as pd import joblib class Main(object): def __init__(self): self.model_path = './IrisClassifier.sav' self.model = joblib.load(self.model_path) def train(self, training_directory): (X,y) = self.load_data(os.path.join(training_directory, 'train.csv')) self.model.fit(X,y) def evaluate(self, evaluation_directory): (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv')) return self.model.score(X,y) def save(self): joblib.dump(self.model, self.model_path) def load_data(self, path): # The last column in csv file is the target column for prediction. df = pd.read_csv(path) X = df.iloc[:, :-1].get_values() y = df.iloc[:, 'y'].get_values() return X,y - Edit requirements.txt if needed:
pandas==1.0.1 scikit-learn==0.19.0
pandas==1.0.1 scikit-learn==0.19.0 - Final folder (package) structure:
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt - train.py
IrisClassifier/ - IrisClassifier.sav - main.py - requirements.txt - train.pyNote: This model can now be first served and, as new data points come into the system via Robot or Human-in-the-Loop, training and evaluation pipelines can be created leveraging train.py.