Best practices for creating categorization models
Use categorization analysis to assign labels to text. In Pega Platform™, you can categorize text into topics, sentiments, and intents.
The aim of topic detection is to assign a piece of text to one or multiple categories. For example, you can categorize customer queries to make them easier to sort, manage, and respond to.
In sentiment analysis, you identify the underlying affective state that is expressed in a piece of text. The main objective of sentiment analysis is very similar to topic detection because you classify text into three predefined categories: positive, neutral, negative. You can use sentiment analysis for customer feedback analysis, brand monitoring, discourse analysis, and so on.
With intent detection you can detect the purpose of a piece of text, for example, to detect whether it is a complaint, request for information, a threat of churn, and so on. For example, the intent of the sentence How can I get a refund? is an information request. You can use intent analysis as part of chatbots. Determining the user intent is important for a chatbot to make a relevant response.
Definition of taxonomy
Defining a taxonomy is an essential part of building a topic detection model in Pega Platform. A taxonomy is a CSV file that contains a list of labels (topics) that you want to assign to the text you analyze. Each topic is associated with keywords and key phrases that the model uses to distinguish between topics, as demonstrated in the following simplified example:
|Topics||Keywords and key phrases|
|In-store support||store,office,premises,"shop assistant",clerk|
|Phone support||"voice response","push button",consultant,"automated response"|
Start the development of a topic detection model by creating a list of topics that you want to assign text into and associating each topic with words or phrases that are topic-specific. For more information about the structure of the taxonomy CSV file and rules for constructing a taxonomy for Pega Platform, see Requirements and best practices for creating a taxonomy for rule-based classification analysis.
Training data for categorization analysis
Machine-learning models require a data set for testing and training purposes. During training, the model learns to classify data into categories by looking at each training record and the associated label. By applying an algorithm on the training data, the model develops rules and patterns for text classification. You can select some of the training data as the testing set to determine the accuracy of the model. Test records are excluded from the training process. After the model finishes training, it uses its internal patterns and rules to predict the label of training records. If the label that the model determined (the machine outcome) matches the label that you assigned to a record (the manual outcome), the model's accuracy increases. Any mismatches between the outcomes decrease the final accuracy score of a model.
Building the data set for training and testing can be a difficult step in the model development process, depending on the classification problem that you want to solve. For example, you can have an excessive amount of training data or very little of it. An excessive amount of training data can lead to model overfitting. Overfitting happens when a model learns the noise in the training data so that it negatively affects the performance of the model on new data. When there is not enough training data, the model cannot learn all the patterns and rules to correctly distinguish between labels.
Upload the data set for training and testing of a categorization model as a CSV file. That file must have the following columns:
|@uPlusGlasses do you have UVA and UVB protection on your own brand sunglasses? Just asked a shop assistant who didn’t know :(||In-store support|
|Is it just me or do @uPlusPhones consultants never pick up the phone ??? it’s been 25mins already?||Phone support|
|Hey our server has been offline for 2 days now, no phone-support, no reply to high - priority ticket? What's going on???||Phone support||Test|
|Given my recent customer service @uPlusCoffee store I'm not in the least surprised that their profits are down…||In-store support||Test|
The Content column contains the text input for the model to learn from. Depending on the type of documents that you want to analyze with the model (for example emails or tweets), each training record can be an entire document or just a single sentence. The Result column contains the predefined label. For topic detection, it must be a name of a topic (for example, Phone support), for sentiment analysis, it is a sentiment value (positive, negative, or neutral), and for intent analysis, it is one of the user intents that you want to detect (for example, churn, inquire, complain, and so on). The Type column determines whether a record is assigned to the training data set or testing data set. If the Type column is empty for a record, that record belongs to training data. When building a model, you can customize the split ratio between the training and test sets.
Guidelines for building and refining a training and test data set for categorization analysis
Follow these guidelines to construct a data set for training and testing of a categorization model in Pega Platform:
- Pega recommends that for each category you create at least 20 records in the training sample.
- Distribute records evenly across categories; otherwise, the model will be biased toward the overrepresented categories.
- Pay attention to the names of categories. Spelling mistakes lead to the creation of an excessive number of categories and affect the model's accuracy.
- Inspect your data for categories that have overlapping training data (for example, they often share the same words or phrases). Sometimes the textual data can be classified as belonging to multiple categories, which might lead to high recall but a low precision score for categories that overlap with each other.
- Make sure that the statistics in the test sample are sufficient. By default, uniform sampling uses 70% for training and 30% for testing.
Guidelines for analyzing categorization results
Follow these guidelines to correctly analyze categorization model that you built:
- While evaluating the model, inspect the manual and machine outcome comparison. This is a good indicator of the categories that the model learned the best. The ratio of category representation in the training data set is proportional to the ratio of correct prediction outcomes in the test set.
- A low F-score can indicate that some of your categories overlap or there is a lack of features that make categories clearly distinguishable in your training set. The highest F-score value is 1 and it can be achieved only if the test data set is the same as the training data set.
- The F-score for keyword match is -1.
- Inspect the performance metrics of your model to discover whether the recall and precision scores are similar across all categories. An indicator that you overfitted the model is that some categories have extremely high precision and recall while others perform extremely poorly. A stable model has equal precision and recall scores across all categories.
Guidelines for uploading a categorization model as part of a Text Analyzer rule
When you first deploy a categorization model as part of a Text Analyzer rule, Pega recommends combining the rule-based (the taxonomy) and machine-learning (the model) approaches by setting the confidence score threshold for categorization to 0.5 and selecting the option to fall back to rule-based categorization if the confidence score threshold is not met. A topic for a piece of text is undetected when the confidence threshold for any topic is below 0.5 and no match in the corresponding taxonomy is found by the Text Analyzer.
For more information, see Configuring categorization settings.
Setting the hybrid approach for categorization analysis in a Text Analyzer