- Getting started
- Balance
- Clusters
- Concept drift
- Coverage
- Datasets
- General fields (previously entities)
- Labels (predictions, confidence levels, hierarchy, etc.)
- Models
- Streams
- Model Rating
- Projects
- Precision
- Recall
- Reviewed and unreviewed messages
- Sources
- Taxonomies
- Training
- True and false positive and negative predictions
- Validation
- Messages
- Administration
- Manage sources and datasets
- Understanding the data structure and permissions
- Create a data source in the GUI
- Uploading a CSV file into a source
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Delete a source
- Export a dataset
- Using Exchange Integrations
- Preparing data for .CSV upload
- Model training and maintenance
- Understanding labels, general fields and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Dastaset status
- Model training and annotating best practice
- Training with label sentiment analysis enabled
- Train
- Introduction to Refine
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using Check label and Missed label
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using Rebalance
- When to stop training your model
- Using general fields
- Generative extraction
- Using analytics and monitoring
- Automations and Communications Mining
- Licensing information
- FAQs and more
Uploading a CSV file into a source
User permissions required: 'Sources admin' AND 'Edit messages'.
To upload data from a CSV file into a data source, navigate to the Sources page (via the admin console, accessed via the cog in the top right of your page) and locate the source you'd like to upload data into.
Click the upload icon in the top right-hand corner of the data source card (as shown below).
Then click 'Select file' and choose the CSV file you wish to upload.
The selected file must meet the following criteria:
- The file should contain headers on the first line and be delimited by commas or tabs
- A minimum of three columns are required: the message text contents (the message), a timestamp, and a unique ID that identifies the message
- All text fields in your CSV file should be surrounded by double quotes
- The file must be encoded as either UTF-8, UTF-16, or UTF-32 (the platform automatically detects which one)
- The CSV file should be 64 MiB or less. If you have a larger file, you can still upload it by splitting it into multiple files, each less than 64 MiB
If your file meets the above criteria, you can then configure and upload the messages in the next step:
Select the required columns from each of the dropdown lists containing the column headers detected within the CSV file:
- ID column:
- This must be a column containing a unique ID that can identify the message
- The message IDs can only contain ASCII alphanumeric characters (A-Z a-z 0-9) and punctuation (except /)
-
Note: If there are existing messages in the source with the same ID, they will be updated to match the contents of the new file
- Message column:
- This is simply the column that contains the message text that you want to analyse in the platform
- Timestamp column:
- This is column containing the date and time the message was recorded
- The timestamp format is flexible and will be inferred automatically by the platform
If you have data containing subject lines, threads, or participants (typically seen in cases or email threads), you can also upload these additional columns within your CSV file:
- Subject Column
- Choose which column contains the message Subject
- Sender Column
- Choose which column contains the Sender
- To Column
- Choose which column contains the Recipient(s). Multiple recipients should be semicolon separated.
- Cc Column
- Choose which column contains the Cc'd Recipient(s). Multiple recipients should be semicolon separated
- Thread ID Column
- Choose the column that contains the message Thread ID
- A thread ID is what ties together different messages to the same thread
Sender/To/CC format:
- The following conditions in the sender/to/cc fields will trigger errors:
- Exceeds maximum number of recipients (max 2048 recipients per thread)
- Sender or recipient exceeds maximum character limit (max 512 characters per recipient)
- Two or more semicolons are found in a row (e.g. - the following is incorrectly formatted: john@email.com ; beth@email.com)
- Although the platform will strip out any white space before or after a recipient, it will not do any additional data cleansing.
- Example formats you may want your data in (not an exhaustive list):
- Example 1 - Robert Bog <rob.bog@gmail.com>; John Smith <john.smith@gmail.com>
- Example 2 - rob.bog@gmail.com ;john.smith@gmail.com
- Example 3 - rob.bog@gmail.com ; john.smith@gmail.com
- Example formats you may want your data in (not an exhaustive list):
- The platform will delimit the different recipients by the semicolon (;)
- Before uploading your data, please ensure the emails are formatted in an appropriate format
- Please note that in a typical threaded use case (e.g.: emails), there should only be 1 sender in each 'sender' cell
Timestamp format:
- If your chosen timestamp format is ambiguous for the order of days / months / years (e.g. 01/02/03 10:10), you can suggest
the correct interpretation:
- 2nd of January 2003 - None
- 1st of February 2003 - Day first
- 3rd of February 2001 - Year first
- 2nd of March 2001 - Day first + Year first
- To avoid ambiguity, it is recommend to supply timestamps in the RFC 3339 format if possible (e.g. 2020-01-31T12:34:56Z for UTC or with a timezone: 2020-08-031T11:20:60-08:00)
Then select the additional user properties you want to upload with the messages. User properties are contextual metadata associated with each message that are filterable in the platform. These are also potentially used by the machine learning models in the platform. There are two types, either string or number:
- String user properties are categorical metadata (typical examples include IDs, countries, counterparties, etc.)
- Number user properties are numeric metadata (typical examples include NPS, email statistics, amounts, etc.)
If your file contains an NPS score as a user property, this must be included as a number property and named 'NPS' only, in order to trigger native NPS charts to load in the platform.
Once you've selected all of the user properties, click 'Upload'.
You'll then be prompted to inspect the uploaded messages in a dataset that contains the source you uploaded data into. If the source is not associated with any datasets yet, you can create a new one to check that the upload is as expected.
If you made a mistake when selecting the user properties you can re-upload the same file, and the platform will use the column ID as the identifier to overwrite the existing messages and properties (this will not affect any labels applied to existing messages).
Troubleshooting
Hopefully your upload will run smoothly, but it's possible that you'll encounter an issue during the upload process and see an error message. We've outline some of them below and why they occur, to help you resolve or avoid them.
In the error messages below, {something} maps to contextual information about where the error occurred. Additionally, the way we refer to a position in the file is standardised as:
String | Expands to: |
---|---|
{position} | record {row-number} on line {line-number} column {column-number} (byte {byte-number}) |
Here are some possible error messages users may encounter when uploading CSV files:
Error Kind | Error Message | Description |
---|---|---|
Not Enough Columns | The CSV file only contains {number-columns} column(s), but at least 3 are needed (text, timestamp and id) | The uploaded CSV doesn't contain at least 3 columns or the platform has mis-detected the encoding of the file. |
Invalid Encoding | The file contains invalid characters (encoding detected as {detected-encoding}) | The file is not correctly encoded as UTF-8 / UTF-16 / UTF-32 (the platform automatically detects the format of the file) |
Invalid Header | string:ti:er' does not match'(^delimiter|id|message|timestamp |timestamp_default_utc_offset |timestamp_day_first|timestamp_year_first\\Z)|(^(?P<property_type>number|string):(?P<name>\\w(?:[\\w]{0,30}\\w)?)\\Z)' | If a column header is an invalid name for a user property, the platform returns the default message for when the schema of a request is invalid. Check that each column header is a valid format for its purpose. Max length for a column header is 32 alphanumeric characters |
Unequal Row Lengths | The CSV contains unequal row lengths. Message {position} has {number} fields, but the previous record has {number} fields. | The CSV contains rows with different numbers of cells in them or that are inconsistent with the number of headers. |
Id format | Invalid message id for {record}. Ids can only consist of ASCII alphanumeric characters and punctuation (except '/'). Cell value: {cell-value} | This error occurs when an Id field consists of invalid characters as described in the error message. |
Id length | Id is too long for message {record}. It has {number} bytes, expected at most 1024 | This error occurs when an id field is longer than the maximum allowed length (1024 characters) |
Timestamp Format | Incorrectly formatted timestamp in message {position}: {timestamp-error-message}. Cell value: {cell-value} | This error occurs when a timestamp field could not be parsed. |
Message Length | Message is too long for message {position}. It has {number} bytes, expected at most 65536 | This error occurs when a message field is longer than the maximum allowed length (65536 characters). |
Number Property Format | Incorrectly formatted number in message {position}: {number-error-message}. Cell value: {cell-value} | This error occurs when a number user property field could not be parsed. The platform should allow any format that can reasonably be decoded as a number. |
Property Length | Property is too long for message {position}. It has {number} bytes, expected at most 4096 | This error occurs when a user property field is longer than the maximum allowed length (4096 characters). |
Unknown Error | Unknown CSV error: {underlying-error-message} | The above list is not completely exhaustive - if an unknown error occurs, retry the upload. |