- Release Notes
- Getting started
- Notifications
- Projects
- Datasets
- Data Labeling
- ML packages
- Out of the box packages
- Pipelines
- ML Skills
- ML Logs
- Document UnderstandingTM in AI Center
- AI Center API
- Licensing
- AI Solutions Templates
- How to
- Basic Troubleshooting Guide
AI Center
Text Classification
OS Packages > Language Analysis > TextClassification
This is a generic, retrainable model for language Classification. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.
This model is a deep learning architecture for language classification. It is based on BERT, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research.
The main driver of the performance of the model will be the data quality used for training. Additionally, the data used to parametrize this model may also influence performance. This model was trained on the top 100 languages with the largest Wikipedias (full list)
All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.
For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).
Two options are possible to structure your dataset for this model. You can't use both options at same time. By default model will look for dataset.csv file in top folder directory if found it uses option 2 here otherwise model try to use option 1 (folder structure).
Use folder structure to separate your class
Create one folder for each class (at top level of the dataset) and add one text file per data point in corresponding folder (the folder is the class and the file only has the input). Dataset structure looks like this:
Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..
Dataset
-- folderNamedAsClass1 # the name of the folder must be name of the class
---- text1Class1.txt #file can have any name
...
---- textNClass1.txt
-- folderNamedAsClass2
---- text1Class2.txt
...
---- textMClass2.txt
..
Use one csv file
Regroup all your data into one csv file named dataset.csv at top level of your dataset. The file will need to have two columns input (the text) and target (the class). It looks as follow:
input,target
I like this movie,positive
I hated the acting,negative
input,target
I like this movie,positive
I hated the acting,negative
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina.