- Getting started
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields (previously entities)
- Labels (predictions, confidence levels, hierarchy, etc.)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Reviewed and unreviewed messages
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Create a data source in the GUI
- Uploading a CSV file into a source
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Delete a source
- Export a dataset
- Using Exchange Integrations
- Preparing data for .CSV upload
- Model training and maintenance
- Understanding labels, general fields and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Understanding the status of your dataset
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Train
- Overview
- Training using clusters
- Training using Search (Discover)
- Introduction to Refine
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining
- Licensing information
- FAQs and more
Training using clusters
User permissions required: ‘View Sources’ AND ‘Review and annotate’.
Once your data is in the platform, the platform will group and display 30 clusters of communications (messages) that it believes share concepts or similar intents. The aim of this part of the training process is to go through each of these clusters and annotate the data presented in each of them.
This process makes training the model easier and faster to begin with, as you can add labels to multiple similar messages at once, as well as adding/removing labels to individual messages as required.
Helpful tips for annotating clusters:
- Don’t spend too long thinking about the name of the label. You can rename a label at any point during the training process.
- Be as specific as possible when naming a label and keep the taxonomy as flat as possible initially (don’t add too many child labels). It is better to be as specific as possible with your label name at the outset as you can always change and restructure the hierarchy later. At this stage you should add as many labels as possible to a message as you can always go back and delete them later, which is quicker and easier than expanding an existing label.
- Remember it’s often easier to create a more specific, finer-grained taxonomy in the first instance. If it’s too detailed it’s easy to edit and ‘prune’ your taxonomy later. This means to add more rather than less labels and sub labels
- It’s good to start with labels in a flat hierarchy (not adding too many sub-labels) – you can always restructure the taxonomy to a more hierarchical structure later
- Each message can have multiple labels assigned to it – make sure to apply all relevant labels, otherwise you teach the model not to associate it with the label that you have omitted
- It is better to take the time to carefully annotate now, so that the machine can rapidly and precisely predict labels in future
- Not all clusters will have obviously similar intents and it’s ok to move on if they are all different
When you first create a new Dataset you may find that Discover is empty as shown below. Don’t worry, this is simply because the platform's algorithms are busy working in the background to group your messages into clusters. Depending on the number of messages in the data source this could take up to a few hours to process.
The layout of Discover and an example cluster are shown below. In this example, the platform has detected that these messages share the common theme of the comfort of the hotel beds:
Layout explained:
A
- Toggle button to switch between 'Cluster' and 'Search' modeB
- Drop-down menu that lets you switch between different clustersC
- Button to apply a label to all of the messages shown on the pageD
- One of six messages shown from cluster #7 (each cluster contains 12 messages)E
- Button to apply a label to an individual messageF
- Drop-down menu to adjust the number of messages shown on the page (between 6 and 12)G
- Buttons to adjust and invert the selection of messages on the pageH
- Button to de-select a message to exclude it from labels added in bulk
As highlighted in the image below, Discover highlights the parts of a message that most contribute to that message being included in the cluster, helping you identify the common themes quicker:
Discover highlighting common themes
- The darker lines indicate more importantparts of the span (this is explained when you hover over it)
- The lighter coloured lines indicate a medium and slightly weaker contribution to the cluster
1. Review each message in the cluster
2. If you think there is a label that applies to all messages on the page, select ‘Add label’
3.Type in the name of the label and hit enter or click the pin button that appears (you can add several labels at once this way, just type in another label and click the pin button again).
4. Click the ‘Apply labels’ button to assign the label(s) to the messages. The assigned labels will now appear underneath every message on the page.
Alternatively, you can add a label to individual messages by clicking the ‘Add label +’ button highlighted underneath it.
If you want to add a label to a group of messages on the page, but wish to exclude one or several, you can de-select them using the toggle button highlighted (A). You can then invert the selection or de-select / reselect all using the buttons highlighted at the top (B).
You can view different pages of the same cluster (A) and adjust the number of messages per page (B) using the buttons highlighted. Once the cluster is annotated, you can move onto a new cluster using the drop down list below (C).
The model will present you with 30 clusters and it’s important to work your way through them to create a solid basis for the Explore phase. If a cluster isn’t relevant to you, however, just skip over it.
Discover begins to retrain after a significant amount of training is completed. After 180 messages have been annotated (half of the clusters), Discover will retrain and update the clusters. Don't be put off, just carry on working through them until you've reviewed at least30.