This content has been archived and is no longer being updated. Links may not function; however, this content may be relevant to outdated versions of the product.

Table of Contents

Training data size considerations for building text analytics models

Depending on the type of algorithm that you use, the size of training data for a text analytics model can affect the build time. For example, a model that is fed with a very large training data set (for example, 10,000 to 20,000 records per category) can take more than one hour to generate.


Pega Platform® provides a set of algorithms that you can use to train your classifier for sentiment and classification analysis. Depending on the algorithm that you use, the building times might vary. For example, Naive Bayes performs the fastest analysis of training data sets. However, other algorithms provide more accurate predictions.

For more information about building text analytics models in Pega Platform, see Creating text classification analysis models and Creating sentiment analysis models.

Naive Bayes

Naive Bayes is a simple but effective algorithm for predictive modeling that assumes that training features are independent of each other. Even though this assumption is incorrect for text data, this classifier can be very effective. The main advantage of choosing Naive Bayes over the other available algorithms is that it provides the fastest build time for large training data sets. Naive Bayes algorithm is available for classification analysis in Pega Platform.

Maximum Entropy

Maximum Entropy (MaxEnt) classifier is a probabilistic classifier that belongs to the class of exponential models. Unlike Naive Bayes, MaxEnt does not assume that the features are conditionally independent of each other. Instead, the classifier iterates multiple times over the training data and selects the model that has the largest entropy. This classifier can be used to solve various text classification problems better than Naive Bayes. However, a MaxEnt classifier takes more time to build than a Naive Bayes classifier. The MaxEnt algorithm is available for classification and sentiment analysis in Pega Platform.

Support Vector Machine

Support Vector Machine (SVM) is a classifier that represents training data as points in an n-dimensional hypercube that is separated by a hyperplane. SVM is used to build supervised, linear, and nonprobabilistic classifiers. SVM performs best with large amounts of training data; however, classifiers based on SVM are the slowest to build. The SVM algorithm is available for classification analysis in Pega Platform.

Model performance

The values in the following table were derived by testing Naive Bayes, SVM, and MaxEnt algorithms in Pega Platform against training data of various sizes. The following characteristics were common to all training data:

  • Number of categories in training data – 10
  • Average character count per row – 233
  • Train and test data split ratio – 60%/40%
  • Heap size – 8 gigabytes
You need at least 100 records per category to build a text analytics model. If the classification taxonomy that you use is hierarchical, each leaf node must have at least 100 records per leaf.
Training records per category Total number of rows File size (megabytes) Does Naive Bayes build? Does MaxEnt build? Does SVM build? Building time (minutes) Testing time (minutes)
1,000 10,000 1 Yes Yes Yes SVM: 20 SVM: 5
10,000 100,000 13 Yes Yes No

MaxEnt: 6.5

Naive Bayes: 0.84

MaxEnt: 10

Naive Bayes: 10

20,000 200,000 26 Yes No No Naive Bayes: 35 Naive Bayes: 22
20,000 200,000 26 No Yes No MaxEnt: 61 MaxEnt: 34








The preceding table shows that you can use multiple algorithms simultaneously. However, if the combined training data size exceeds a certain size, the build might fail.

Related Content

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.