Predictive Models In Marketing And CRM

Predictive Models help e-Commerce brands to get to a level of deep personalization, customer understanding, and in general customer-centricity. Statistical methods and artificial intelligence are used to predict future customer behavior and thereby understand each customer better. To take advantage of this, every brand will need to work with its own data scientists, highly specialized agencies, or an automated machine learning platform such as the CrossEngage Customer Prediction Platform (CPP).

So now, you have set up your first predictive models – What comes next?

Is Your Predictive Model Good Enough?

Deciding whether your carefully crafted predictive model is finally ready for business or not can be a daunting task. A lot of predictive models fail to generate any positive impact and it is hard to know whether they’ll do before you deploy them.

There are, however, some signs that there might be trouble ahead. These signs are easily spotted if you know what to look for and should, therefore, be looked out for before any deployment of a predictive model.

But why should you listen to me? In my day job, I am the CAIO (Chief AI Officer) and Co-Founder of CrossEngage, With the CPP, we assist our clients in examining the auto-generated models to help them decide whether these models are useful for their business.

Automated Model Curation

For that reason, our team has examined thousands of predictive models in the last six years. We built a feature in our platform that automatically creates a first sanity check on the generated models. We call this the automated model curation.

automated-model-curation

While developing this feature, we revisited our common experiences with model metrics and went through hundreds of cases where we knew the business result of the model deployment. In concordance with research on data science, we found out that there are two things you want to avoid in your predictive model because models with these kinds of issues tend to collapse when used in the wild:

false-friends

“False Friends“

Also known as target leakage. This is data that does not belong to the model training process but is still part of it. This happens when you train your predictive model on a dataset that includes information that would not be available at the time of prediction.

overfitting

Overfitting

Overfitting describes a model that models the training data too well. This happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. In terms of weather prediction: Your model may be an excellent predictor for yesterday’s weather but is not able at all to predict tomorrow’s weather.

 

 

Detecting “False Friends“ and Overfitting

Thus, we wanted to concentrate on the metrics that could indicate either “False Friends” or overfitting. The criteria we came up with are necessary but not sufficient criteria for a good model. These curation metrics are therefore meant as a warning light: If one of these metrics is in a bad state, it is a strong sign that something is wrong and this model should not be put in production without a very thorough vetting process.

So the following are the robustness checks. Please bear in mind that these checks refer to categorical supervised models that try to predict whether an event (for example a purchase) will take place or not.

purchases

1. The Number of Positive Cases For the Training

To have accurate predictions, the number of events (purchases) in the model needs to be large enough to allow sensible inferences. You do not want your model shaped by anecdotal data which would inevitably lead to overfitting. In our use case, which is predicting future purchase behavior based on one individual’s customer journey, we found that there need to be at least 1,000 positive cases (purchases) for a model to be valid at all. We only call this metric “good” when there are at least 5,000 positive cases. We also found that there is typically no substantive increase in model quality with more than 15,000 positive cases.

2. The AUC-Value In the Training Data Set

We settled on the very versatile “Area under the Curve” (AUC) metric for capturing the prediction accuracy in the model. It shows the ability of a (classifying) model to distinguish between classes (e.g. purchase or no purchase). The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes (purchase or no purchase).

The one killer feature why we decided to go for this metric is that it works independently of the underlying distribution of positive/negative cases because it does not assume a fixed cut-off value. For a good explanation, look here.

We observed that all AUC values greater than 0.55, though not great, can deliver at least some value in practice. However, for a model to be good, we found that the AUC value should better be greater than 0.70. We also observed that when a model has a very high AUC value, this may not be a good sign: We have a saying in our data science team: “If a model looks too good to be true, then very often it is not”.

So, we look critically at all models that have an AUC value greater than 0.92 and are not using models that have an AUC value greater than 0.99 because that almost always means that there is target leakage.

auc-training-set
auc-training-set

2. The AUC-Value In the Training Data Set

We settled on the very versatile “Area under the Curve” (AUC) metric for capturing the prediction accuracy in the model. It shows the ability of a (classifying) model to distinguish between classes (e.g. purchase or no purchase). The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes (purchase or no purchase).

The one killer feature why we decided to go for this metric is that it works independently of the underlying distribution of positive/negative cases because it does not assume a fixed cut-off value. For a good explanation, look here.

We observed that all AUC values greater than 0.55, though not great, can deliver at least some value in practice. However, for a model to be good, we found that the AUC value should better be greater than 0.70. We also observed that when a model has a very high AUC value, this may not be a good sign: We have a saying in our data science team: “If a model looks too good to be true, then very often it is not”.

So, we look critically at all models that have an AUC value greater than 0.92 and are not using models that have an AUC value greater than 0.99 because that almost always means that there is target leakage.

differences

3. The Difference Between the AUC Value in the Training and the Validation Data Set

Not only the absolute AUC value is of interest. We observed that models that have a high dispersion between the training AUC and the validation AUC often indicate an overfitting problem (You do have a validation set, right? Please do have).

Overfitting means that the algorithm that built your model didn’t capture the underlying patterns in your data but just “learned” your training data. This is not a good thing, because in real life this means you can make no meaningful prediction on new data.

We monitor this by comparing the training AUC value with the validation AUC value. If the validation AUC is significantly smaller, then we start to become very skeptical of the models’ prediction capabilities.

4. The Influence Of the Top Feature In the Model

This metric is the top indicator of having a “False Friend“ in your data. If the most influential predictor (feature) of your model has a very high weight in the overall prediction, this is often a sign of target leakage. Note that this must not be the case, as there might just be a very good explanatory variable (feature) in the data set.

Our takeaway from our analysis is that if the weight of the top predictor is greater than 0.70, you should look very carefully at that predictor and examine whether it could be a “False Friend“. Typically, this can be done by looking at the data-generating process of that feature and deciding whether the data for this feature could have been accidentally timestamped with a wrong (earlier) date/time.

rank
rank

4. The Influence Of the Top Feature In the Model

This metric is the top indicator of having a “False Friend“ in your data. If the most influential predictor (feature) of your model has a very high weight in the overall prediction, this is often a sign of target leakage. Note that this must not be the case, as there might just be a very good explanatory variable (feature) in the data set.

Our takeaway from our analysis is that if the weight of the top predictor is greater than 0.70, you should look very carefully at that predictor and examine whether it could be a “False Friend“. Typically, this can be done by looking at the data-generating process of that feature and deciding whether the data for this feature could have been accidentally timestamped with a wrong (earlier) date/time.

Conclusion

We are very aware that these four metrics are merely the canary in the coal mine. We wanted to show you just the base indicators to check if something went wrong in the process of creating predictive models. These robustness checks are not a green light for putting the model into practice.
Before doing this, you need to ask yourself: “Do the results of this model help my specific business case?”. The answer to this question is often much harder to find than evaluating the statistical soundness of the model.

Click here for the original article on Medium.

Dr. Dennis Proppe

About the author: Dr. Dennis Proppe is Chief AI Officer at CrossEngage and responsible for product development and vision execution. He has 15 years of experience in machine learning and has been building AI and engineering teams for ten years. Dennis holds a PhD in Marketing and Statistics from Kiel University.