Table of Contents

Article

Best practices for adaptive and predictive model predictors

When you create an adaptive or predictive model, the input fields that you select as predictor data play a crucial role in the predictive performance of that model. Some data types, such as dates and text, might require preprocessing. Follow best practices when you select predictors and choose data types for adaptive and predictive analytics.

You can test your models and various predictor types in the DMSample application. For more information, see Getting started with DMSample.

Predictor selection considerations

When a model makes a prediction, predictive power is the largest when you include as much relevant, yet uncorrelated, information as possible. You can make a wide set of candidate predictors available, as many as several hundred or more. Both Predictive Analytics Director (PAD) and Adaptive Decision Manager (ADM) automatically select the best subset of predictors. They group predictors into sets of related predictors and then select the best predictor from each group, that is, the predictor that has the strongest relationship with the outcome (measured in AUC). In adaptive decisioning, this predictor selection process repeats periodically.

To achieve the best results, use predictors that provide data from many different data sources, such as:

  • Customer profile information, for example, age, income, gender, and current product subscriptions. This information is usually part of the Customer Analytic Record (CAR) and is refreshed regularly.
  • Interaction context, such as recent web browsing information, support call reasons, or input that is gathered during a conversation with the customer. This information can be highly relevant and, therefore, very predictive.
  • Customer behavior data, such as product usage or transaction history. The strongest predictors to predict future behavior typically contain data about past behavior.
  • Scores or other results from the off-line execution of external models.
  • Other customer behavior metrics. 

These data sources are included in a CAR that summarizes customer details, contextual information for the channel, and additional internal data sources, such as Interaction History.

Verify that the predictors in your models accurately predict customer behavior by monitoring their performance on a regular basis. For more information, see Adaptive models monitoring.

You cannot use all data to drive predictions. There are legal and ethical reasons for not using some data, depending on the context (for example, ethnic origin, exact address, or other personal details). Before selecting such data for predictors, check with your organization about which rules apply, and focus on behavioral data that describes what the customer has done, instead of choosing the fields that describe who that person is.

Data types for predictors

Follow these guidelines to gain a basic understanding of how you can use different data types in adaptive and predictive analytics:

Numeric data

You can use basic numeric data, such as age, income, and customer lifetime value (CLV), without any preprocessing. Your model automatically divides that data into relevant value ranges by dynamically defining the bin boundaries. The following example shows uneven bin sizing for a numeric predictor:

Sample numeric data distribution from a Kaggle dataset for marketing
Sample numeric data distribution from a Kaggle dataset for marketing

Categorical (symbolic) data

You can feed strings with up to 200 distinct values without any preprocessing. Such data is automatically categorized into relevant value groups, as shown in the following example.

Sample categorical data distribution
Sample categorical data distribution

For strings with more than 200 distinct values, group the data into fewer categories for better model performance. For more information, see Codes.

Although one-hot encoding is common in data science, do not implement it for symbolic values. The built-in preprocessing feature in Pega Platform™ efficiently handles symbolics, without any additional preparation.

Customer identifiers

Customer identifiers are symbolic or numeric variables that have a unique value for each customer. Typically, they are not useful as predictors, although they might be predictive in special cases. For example, customer identifiers that are handed out sequentially might be predictive in a churn model, as they correlate to tenure.

Codes

For meaningful numeric fields, feed code fragments to the model as separate predictors. Simple values require only basic transformation. For example, you can shorten postal codes to the first 2 or 3 characters which, in most countries, denote geographical location.

For categorical numeric fields, where numbers do not carry any mathematical meaning, categorize input as symbolic in Prediction Studio. If any background information is available, such as hierarchical code grouping, add new fields that are derived from the code (for example, product versus product family).

Dates

If you use dates without any preprocessing, predictive and adaptive models categorize them as numeric data (absolute date/time value). The meaning of such input values changes over time; in other words, the same date might carry a different meaning, depending on the time of reference. For example, July 4th means recently when you run a model on July 5th, but when you perform the analysis on December 6th, the meaning is in the past few months.

Because of that ambiguity, avoid using absolute date/time values as predictors. Instead, take the time span until now (for example, derive age from the DateOfBirth field), or the time difference between various pairs of dates in your data fields. Additionally, you can improve predictor performance by extracting fields that denote a specific time of day, week, or month.

The following examples show date and time values used as predictors. 

Sample predictors derived from a date (actual example from a Kaggle data mining competition)
Sample predictors derived from a date (actual example from a Kaggle data mining competition)
Sample predictors extracted from a date/time stamp (duration)
Sample predictors extracted from a date/time stamp (duration)

Text

Do not use plain text to create predictors without any preprocessing, because it contains too many unique values. Instead, run a Text Analyzer rule on your text input to extract such fields as the intent, topic, and sentiment, to use them as predictors.

A Text Analyzer rule puts its output in a property of class Data-NLP-Outcome. You can use elements from this class as model input, as strategy properties, and so on. Some of the frequently used properties include the following:

PropertyDescription
<Output Field>.pyTopics(1).pyNameThe first category (topic) in a set of multiple topics. A related .pyConfidenceScore property assigns the confidence level for rules that are based on models.
<Output Field>.pyIntents(1).pyNameThe first intent in a set of multiple intents.
<Output Field>.OverallSentimentAn overall sentiment that is mapped to a string (for example, positive, neutral, or negative).
<Output Field>.OverallSentimentValueAn overall sentiment as a numeric value (in the range of -1 to +1).

Event streams

Do not use event streams as predictors without preprocessing, but extract the data in an event strategy instead. Store the aggregations in a Decision Data Store (DDS) data set that is typically keyed by Customer ID, as shown in the following example:

High-frequency events aggregation
High-frequency events aggregation

In the decision process, this data set is joined with the rest of the customer data, and the aggregates are treated like any other symbolic or numeric field, as shown in the following example:

Merging aggregated events with customer data
Merging aggregated events with customer data

Interaction History

Past interactions are usually very predictive. You can use Interaction History (IH) to extract such fields as the number of recent purchases, the time since last purchase, and so on. To summarize and preprocess IH to use that data in predictions, use IH summaries. A list of predictors based on IH summaries is enabled by default, without any additional setup, for all new adaptive models. 

For more information about using IH data in adaptive analytics, see Using Interaction History to drive predictions and Add predictors based on Interaction History.

Multidimensional data

For models that make your primary decision for a customer, use lists of products, activities, and so on, as the source of useful information for predictors. Create fields from that data either through Pega Platform™ expressions that operate on these lists or through substrategies that work on this embedded data, and then complete aggregations in strategies. Regardless of your choice, use your intuition and data science insight to determine the possibly relevant derivatives, for example, number-of-products, average-sentiment-last-30-days, and so on.

Pega internal data

For predictions in the context of a Pega application, Pega internal data might be useful to add for predictors on top of external non-Pega customer data.

Real-time contextual data

To increase the efficiency and performance of your models, do not limit the personalization of your decisions and predictions only to the customer. By additionally supplementing the decision process data with the interaction context, you can adjust the predictions for a customer and provide different outcomes depending on the context. The changing circumstances might include the reason for a call, the particular part of the website or mobile app where the customer operates, the current Interactive Voice Response (IVR) menu, and so on.

Customer behavior and usage

Customer behavior and interactions, such as financial transactions, claims, calls, complaints, and flights, are typically transactional in nature. From the predictive analytics perspective, you can use that data to create derived fields that summarize or aggregate this data for better predictions, for example, by adopting the Recency, Frequency, Monetary (RFM) approach.

For example, use RFM to track the latest call of a certain type, the frequency of calls in general, and their duration or monetary value. You can perform that search across different time periods, and potentially transform or combine some of that data to extract detailed statistics, such as the average length of a call, the average gigabyte usage last month, an increase or decrease in usage over the last month compared to previous months, and so on.

For more information, see Event streams.

Model scores and other data science output

Scores from predictive models for different but related outcomes and other data science output might be predictive as well. Common data science output types that are useful as predictors include:

  • Classifications
  • Segmentations and clusters
  • Embeddings
  • Dimensionality reduction scores (PCA)

A typical application of data science output in analytics is the use of a higher-level product propensity score for a large number of adaptive models that are related to the same product. For example, you can apply a single propensity to buy or use a credit card score to all the adaptive models that are related to credit card proposition channel variants.

You calculate scores through PAD and PMML, or you make them available in a big database.

If you decide to use scores as predictors in your models, evaluate whether the models that include such a score perform better at the model level, by verifying the area under the curve (AUC) and the success rate metrics.

 

 

 

Published July 17, 2018 — Updated April 17, 2019


100% found this useful

Related Content

Have a question? Get answers now.

Visit the Pega Support Community to ask questions, engage in discussions, and help others.