Keyword Based Classifier

What Is Keyword Based Classifier

The Keyword Based Classifier is a simple classifier that searches for repeating string sequences within a given file, in order to perform document classification.

The algorithm is built around the concept of document titles and starts from the premise that document types with titles usually have a relatively low variation into how those titles look in documents.

When classifying a file into a document type, the Keyword Based Classifier:

finds the best matching string or string collection, from its learning data, that applies to a taxonomy document type. Confidence is computed based on:
- how close is the match to the beginning of the document,
- how many times the match has been confirmed by knowledge workers and reinforced in the learning data.
reports on the highest scoring document type, with the underlying matching configuration.

The Keyword Based Classifier can work with a single string entry (one string that is considered as one entry in the learning data the Classifier is using), or with an entry containing multiple strings (two or more strings that form a single entry). In case of multiple string, the Classifier applies the matching algorithm on each string individually and then computes a simple average of the confidences of the identified matches.

Example

Let's take the example below:

if an entry contains a single string, for instance, "this is my match", then the Keyword Based Classifier searches and rates this string as a potential document type match (according to which document type the string is attributed to).
if an entry contains three strings, for instance, ["this is a match", "needs more evidence for filtering", "yet another one"], then the Keyword Based Classifier searches and rates each one of the three strings, and then computes a simple average of the matching confidences for reporting.

The keyword set can be defined within a line or by using multiple lines. When set within a line, it identifies the given input. For example, if x, y, and z are listed as keywords, then the search is looking for x and y and z.

Having multiple lines defined means that the search is looking for the keywords listed in the first line, or the second line, or the third until it covers all lines and identifies the best matches, thus increasing the confidence score by simply having identified more matches from more available keywords.

When to Use

You should consider using this classifier if:

your files contain one and only one document type each (so no file splitting is required);
your files contain evidence related to the document type in the first three pages of the file.