- Release Notes
- Before you begin
- Getting started
- Installing AI Center
- Migration and upgrade
- Projects
- Datasets
- Data Labeling
- ML packages
- Out of the box packages
- Pipelines
- ML Skills
- ML Logs
- Document UnderstandingTM in AI Center
- AI Center API
- How to
- Licensing
- Basic Troubleshooting Guide
English Text Classification
OS Packages > Language Analysis > EnglishTextClassification
This is a generic, retrainable model for English Classification. This ML Package must be retrained, if deployed without training first, the deployment will fail with an error stating that the model is not trained.
This model is a deep learning architecture for language classification. It is based on RoBERTa, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research
All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package.
For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).
Reading multiple files
By default, this model will read all files with a .csv and .json extension (recursively) in the provided directory.
CSV File Format:
Each CSV file is expected can have any number of columns, only two will be used by the model. Those columns are specified by the parameters input_column (if not set, will default to “input”) and target_column (if not set, will default to “target”).
For example, a single CSV file may look like this:
input,target
I like this movie,positive
I hated the acting,negative
input,target
I like this movie,positive
I hated the acting,negative
In the example file above, any type of pipeline can be triggered without adding any extra parameters. In the following example, the columns need to be specified explicitly:
review,sentiment
I like this movie,positive
I hated the acting,negative
review,sentiment
I like this movie,positive
I hated the acting,negative
Any files that do not have the columns specified by input_column and target_column will be skipped. Furthermore, the delimiter that will be used to parse the file can be set by setting the csv_delimiter parameters. For example, if your file is actually tab-separated, save it with the extension .csv and set the parameter csv_delimiter to ** **
JSON File Format:
Each JSON file can be for a single data point or a list of data points. That is, each JSON file can have one of two formats: Single data point in one JSON file:
{
"input": "I like this movie",
"target": "positive"
}
{
"input": "I like this movie",
"target": "positive"
}
Multiple data points in one json file:
[
{
"input": "I like this movie",
"target": "positive"
},
{
"input": "I hated the acting",
"target": "negative"
}
]
[
{
"input": "I like this movie",
"target": "positive"
},
{
"input": "I hated the acting",
"target": "negative"
}
]
As for csv file, if input_column and target_column parameters are set, the format overrides “input” with input_column and “target” with target_column.
All valid files (all CSV files and JSON files that conform to the format above) will be coalesced.
Reading a single file
In some cases, it may be useful to use a single file (even if your directory has many files). In this case, the parameter csv_name can be used. If set, the pipeline will only read that file. When this parameter is set, two other additional parameters are enabled:
- csv_start_index which allows the user to specify the row where to start reading.
- csv_end_index which allows the user to specify the row to end reading.
For example, you may have a large file with 20K rows, but may want to quickly see what a Training run on a subset of data would look like. In this case, you may specify the file name and set csv_end_index to a value much lower than 20k.
- input_column: change this value to match your dataset input column’s name (default “input”)
- target_column: change this value to match your dataset input column’s name (default “target”)
- evaluation_metric: set this value to change metric return by evaluation function and surfaced in the UI. This parameter can be set to one of the following values: "accuracy" (Default), "auroc" (area under the ROC curve), “precision”, “recall”, “matthews correlation” (matthews correlation coefficient), “fscore”.
- csv_name: use this variable if you want to specify a unique csv file to be read from the dataset.
- csv_start_index: allows to specify the row where to start reading. To be used in combination with csv_name.
- csv_end_index: allows to specify the row to end reading. To be used in combination with csv_name.
Train function produces three artifacts:
- train.csv - The data that was used to train the model, saved here for governance and traceability.
- validation.csv - The data that was used to validate the model.
learning-rate-finder.png
- Most users will never need to worry about this. Advanced users may find this helpful (see advanced section). - train-report.pdf - A report containing summary
information of this run. The first section includes all the parameters that were specified
by the user. The second section includes statistics about the data (the number of data
points for training, validation, and the checksum of each file). The last section includes
two plots:
- Loss Plot – This plots the training and validation loss as a function of the number of epochs. The output ML Package version will always be the version that had the minimum validation loss (not the model at the last epoch).
- Metrics Plot – This plots a number of metrics computed on the validation set at the end of each epoch.
Evaluate function produces two artifacts:
- evaluation.csv - The data that was used to evaluate the model.
- evaluation-report.pdf - A report containing summary information of this run. The first section includes all the parameters that were specified by the user. The second section includes statistics about the data (the number of data points for evaluation and the file checksum). The third section includes statistics of that evaluation (for multi-class, the metrics are weighted). The last section includes a plot of the confusion matrix, and a per-class computation of each of accuracy, precision, recall, and support, as well as their averaged values.
RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, et al.