Training data size considerations for building text analytics models
Depending on the type of algorithm that you use, the size of training data for a text analytics model can affect the build time. For example, a model that is fed with a very large training data set (for example, 10,000 to 20,000 records per category) can take more than one hour to generate.
Pega Platform® provides a set of algorithms that you can use to train your classifier for sentiment and classification analysis. Depending on the algorithm that you use, the building times might vary. For example, Naive Bayes performs the fastest analysis of training data sets. However, other algorithms provide more accurate predictions.
Naive Bayes is a simple but effective algorithm for predictive modeling that assumes that training features are independent of each other. Even though this assumption is incorrect for text data, this classifier can be very effective. The main advantage of choosing Naive Bayes over the other available algorithms is that it provides the fastest build time for large training data sets. Naive Bayes algorithm is available for classification analysis in Pega Platform.
Maximum Entropy (MaxEnt) classifier is a probabilistic classifier that belongs to the class of exponential models. Unlike Naive Bayes, MaxEnt does not assume that the features are conditionally independent of each other. Instead, the classifier iterates multiple times over the training data and selects the model that has the largest entropy. This classifier can be used to solve various text classification problems better than Naive Bayes. However, a MaxEnt classifier takes more time to build than a Naive Bayes classifier. The MaxEnt algorithm is available for classification and sentiment analysis in Pega Platform.
Support Vector Machine
Support Vector Machine (SVM) is a classifier that represents training data as points in an n-dimensional hypercube that is separated by a hyperplane. SVM is used to build supervised, linear, and nonprobabilistic classifiers. SVM performs best with large amounts of training data; however, classifiers based on SVM are the slowest to build. The SVM algorithm is available for classification analysis in Pega Platform.
The values in the following table were derived by testing Naive Bayes, SVM, and MaxEnt algorithms in Pega Platform against training data of various sizes. The following characteristics were common to all training data:
- Number of categories in training data – 10
- Average character count per row – 233
- Train and test data split ratio – 60%/40%
- Heap size – 8 gigabytes
|Training records per category||Total number of rows||File size (megabytes)||Does Naive Bayes build?||Does MaxEnt build?||Does SVM build?||Building time (minutes)||Testing time (minutes)|
|1,000||10,000||1||Yes||Yes||Yes||SVM: 20||SVM: 5|
Naive Bayes: 0.84
Naive Bayes: 10
|20,000||200,000||26||Yes||No||No||Naive Bayes: 35||Naive Bayes: 22|
|20,000||200,000||26||No||Yes||No||MaxEnt: 61||MaxEnt: 34|
The preceding table shows that you can use multiple algorithms simultaneously. However, if the combined training data size exceeds a certain size, the build might fail.