This content has been archived and is no longer being maintained.

Table of Contents

Article

Training data size considerations for building text analytics models

Depending on the type of algorithm that you use, the size of training data for a text analytics model can affect the build time. For example, a model that is fed with a very large training data set (for example, 10,000 to 20,000 records per category) can take more than one hour to generate.

Algorithms

Pega Platform® provides a set of algorithms that you can use to train your classifier for sentiment and classification analysis. Depending on the algorithm that you use, the building times might vary. For example, Naive Bayes performs the fastest analysis of training data sets. However, other algorithms provide more accurate predictions.

For more information about building text analytics models in Pega Platform, see Creating text classification analysis models and Creating sentiment analysis models.

Naive Bayes

Naive Bayes is a simple but effective algorithm for predictive modeling that assumes that training features are independent of each other. Even though this assumption is incorrect for text data, this classifier can be very effective. The main advantage of choosing Naive Bayes over the other available algorithms is that it provides the fastest build time for large training data sets. Naive Bayes algorithm is available for classification analysis in Pega Platform.

Maximum Entropy

Maximum Entropy (MaxEnt) classifier is a probabilistic classifier that belongs to the class of exponential models. Unlike Naive Bayes, MaxEnt does not assume that the features are conditionally independent of each other. Instead, the classifier iterates multiple times over the training data and selects the model that has the largest entropy. This classifier can be used to solve various text classification problems better than Naive Bayes. However, a MaxEnt classifier takes more time to build than a Naive Bayes classifier. The MaxEnt algorithm is available for classification and sentiment analysis in Pega Platform.

Support Vector Machine

Support Vector Machine (SVM) is a classifier that represents training data as points in an n-dimensional hypercube that is separated by a hyperplane. SVM is used to build supervised, linear, and nonprobabilistic classifiers. SVM performs best with large amounts of training data; however, classifiers based on SVM are the slowest to build. The SVM algorithm is available for classification analysis in Pega Platform.

Model performance

The values in the following table were derived by testing Naive Bayes, SVM, and MaxEnt algorithms in Pega Platform against training data of various sizes. The following characteristics were common to all training data:

  • Number of categories in training data – 10
  • Average character count per row – 233
  • Train and test data split ratio – 60%/40%
  • Heap size – 8 gigabytes
You need at least 100 records per category to build a text analytics model. If the classification taxonomy that you use is hierarchical, each leaf node must have at least 100 records per leaf.
Training records per categoryTotal number of rowsFile size (megabytes)Does Naive Bayes build?Does MaxEnt build?Does SVM build?Building time (minutes)Testing time (minutes)
1,00010,0001YesYesYesSVM: 20SVM: 5
10,000100,00013YesYesNo

MaxEnt: 6.5

Naive Bayes: 0.84

MaxEnt: 10

Naive Bayes: 10

20,000200,00026YesNoNoNaive Bayes: 35Naive Bayes: 22
20,000200,00026NoYesNoMaxEnt: 61MaxEnt: 34

 

 

 

 

 

 

 

The preceding table shows that you can use multiple algorithms simultaneously. However, if the combined training data size exceeds a certain size, the build might fail.

Published November 9, 2017 — Updated August 29, 2018

Related Content

Have a question? Get answers now.

Visit the Pega Support Community to ask questions, engage in discussions, and help others.