LinkedIn
Copied!

Table of Contents

Training data size considerations for building text analytics models

Version:

Only available versions of this content are shown in the dropdown

Build your text analytics models efficiently by choosing an optimal algorithm for your training data size. Consider the building times and prediction accuracies that different types of algorithms available in Pega Platform can provide.

Depending on the type of algorithm that you use, the size of training data for a text analytics model can affect the build time. For example, a model that is fed with a very large training data set (such as 10,000 to 20,000 records per category) can take more than one hour to generate.

Algorithms

Pega Platform provides a set of algorithms that you can use to train your classifier for sentiment and classification analysis. Depending on the algorithm that you use, the building times might vary. For example, Naive Bayes performs the fastest analysis of training data sets. However, other algorithms provide more accurate predictions.

Naive Bayes
Naive Bayes is a simple but effective algorithm for predictive modeling that assumes that training features are independent of each other. Even though this assumption is incorrect for text data, this classifier can be very effective. The main advantage of choosing Naive Bayes over the other available algorithms is that it provides the fastest build time for large training data sets. Naive Bayes algorithm is available for classification analysis in Pega Platform.
Maximum Entropy
Maximum Entropy (MaxEnt) classifier is a probabilistic classifier that belongs to the class of exponential models. Unlike Naive Bayes, MaxEnt does not assume that the features are conditionally independent of each other. Instead, the classifier iterates multiple times over the training data and selects the model that has the largest entropy. This classifier can be used to solve various text classification problems better than Naive Bayes. However, a MaxEnt classifier takes more time to build than a Naive Bayes classifier. The MaxEnt algorithm is available for classification and sentiment analysis in Pega Platform.
Support Vector Machine
Support Vector Machine (SVM) is a classifier that represents training data as points in an n-dimensional hypercube that is separated by a hyperplane. SVM is used to build supervised, linear, and non-probabilistic classifiers. SVM performs best with large amounts of training data; however, classifiers based on SVM are the slowest to build. The SVM algorithm is available for classification analysis in Pega Platform.

For more information about building text analytics models in Pega Platform, see Creating machine learning topic models and Determining the emotional tone of text.

Model performance

The values in the following table were derived by testing Naive Bayes, SVM, and MaxEnt algorithms in Pega Platform against training data of various sizes. The following characteristics were common to all training data:

  • Number of categories in training data – 10
  • Average character count per row – 233
  • Train and test data split ratio – 60%/40%
  • Heap size – 8 gigabytes
You need at least 100 records per category to build a text analytics model. If the classification taxonomy that you use is hierarchical, each leaf node must have at least 100 records per leaf.

You can use multiple algorithms simultaneously as shown in the following table. However, if the combined training data size exceeds a certain size, the build might fail.

Performance results of Naive Bayes, SVM and MaxEnt algorithms

Training records per category Total number of rows File size (megabytes) Does Naive Bayes build? Does MaxEnt build? Does SVM build? Building time (minutes) Testing time (minutes)
1,000 10,000 1 Yes Yes Yes SVM: 20 SVM: 5
10,000 100,000 13 Yes Yes No

MaxEnt: 6.5

Naive Bayes: 0.84

MaxEnt: 10

Naive Bayes: 10

20,000 200,000 26 Yes No No Naive Bayes: 35 Naive Bayes: 22
20,000 200,000 26 No Yes No MaxEnt: 61 MaxEnt: 34

Have a question? Get answers now.

Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.