A/B-Testing of Predictive Models – Part 2

In the first blog post about A/B testing, you already learned what an A/B test is good for and the concept behind it. You were also faced with two simple test designs.

In the second part of this blog post, you will learn about two further developed test designs. You will also get concrete implementation tips for execution and evaluation, which are summarized in a practical checklist. Enjoy reading!

Uplift Comparison Test:
Comparison of the Uplift of Two Selections For the Same Customer Group

Uplift is the additional impact of a campaign compared to a previous status quo. To compare the uplift generated by various selection procedures with “doing nothing“ you should additionally include control groups – so-called “zero groups“ – in each selection procedure to compare both with each other.

The procedure starts as the simple A/B test: the total potential is randomly divided into two potentials according to a splitting key:

On one part of the customer base, the previous selection procedure is used.
The other part is selected via the CPP.

For each selection procedure, a contact pool of top customers should be formed. These top customer pools contain the customers rated as good according to the respective selection procedure. The number of contacts per top customer pool depends on the previously defined distribution key.

Out of these pools, the customers are randomly distributed either to the test or to the zero group. The customer pool of the top customers must therefore be as large as the respective test and zero groups together. The test groups are included in the total circulation. The zero groups are not advertised. In this way, the uplift of the test group compared to the zero group can be calculated for each selection (in relative or absolute numbers). Subsequently, the selection procedures can also be compared with regard to their individual uplift.

Uplift Comparison Tests For Different Customer Groups

If you want to compare customer selection procedures that were performed on different customer groups, you can use the same test design as described above for the uplift comparison test. Such a case can arise if, for example, churn prevention campaigns (existing customers who deviate from their usual purchasing behavior) are going to be compared with classic reactivation campaigns (inactive customers).

This test design makes sense for different groups since KPIs such as contact values or conversion rates are not very meaningful if the addressed customer groups differ. In the explained case, a higher conversion rate can be assumed generally for churn prevention than for classic reactivation. Customers defined as still active but with a churn potential will definitely perform better than already inactive customers. A simple comparison of the KPIs could lead to false conclusions in this case.

Uplift Comparison Tests For Different Customer Groups

Carrying Out an A/B Test

In addition to selecting the right test design, you should decide how large the number of customers should be for the test groups and control groups – the distribution key. In the case of the uplift test, the same applies for the distribution within a test group and its respective zero group. The groups do not have to be of equal size, it is also possible to divide the groups according to the ratio 70/30, 80/20, or similar. However, both groups must be large enough to generate statistical significance.

The Sample Size Calculator can be used to calculate the minimum size of the control or zero group (the test group is usually at least the same size or larger). To use the Sample Size Calculator, the following information must be available:

The baseline conversion rate of the already existing alternative
The effect that should be at least measured in the test

Baseline conversion rate is typically very easy to determine based on past campaigns – it represents the average conversion rate of the last (similar) campaigns.

It’s much more difficult to correctly assess the minimal effect of a new measure. As a general rule, the smaller the effect, the larger the groups should be. There is no generally applicable guideline value. It rather depends on how high you set the minimum effect you want to measure and how many contacts you have available. What is good and what is not depends on experience and intuition.

A possible interpretation is: if you want to measure at least x% additional effect in an A/B test (regardless of the test design), you need a total circulation = y.

Furthermore, a randomized division of the potential must be ensured. This can be done, for example, using a command in Excel (“RAND()”), in SQL (“RAND() <= 0.5”), or with other methods in your CRM, DWH, etc.

Evaluating A/B Tests

To be able to evaluate an A/B test, the KPIs (such as contact value or conversion rate) must first be identified for all selection procedures so that the groups can be compared with each other. Secondly, to make sure that there are significant differences (and not random variations) between the results of the groups, it is recommended to perform a statistical significance test. A significance test examines whether the observed difference between the respective control and test groups vary that much that the difference can no longer have happened by chance and thus the effect of the tested measure becomes statistically significant.

For the conversion rates, a chi-square test (significance test in statistics) can answer the question of the relevance of the differences found. To perform the test you have to specify a confidence level. This results in a confidence interval, which represents a statistically calculated range of values. A confidence interval can be used to assess whether the differences in the results actually have statistical relevance or whether they are merely the result of coincidence.

To easily perform a chi-square test, we recommend this link. There you need the number of advertised customers and the number of customers from this group who actually made a purchase during the test period for the test and control groups. In addition, a confidence level must be specified, which is usually 90% or 95% for marketing questions.

The confidence intervals are constructed in such a way that they contain “the true value” with a probability corresponding to the confidence level. For example, with a confidence level of 95% and a confidence interval [8.5% – 22.1%], the actual conversion rate thus lies between 8.5% and 22.1% with a 95% probability.

More detailed information on statistical significance tests in marketing can be found here.

Checklist For Implementing A/B Tests

Define the total customer base eligible for the customer selection methods (customer group that should receive the tested measure, for example existing customers or former customers).
Define the campaign (hypothesis) to be tested (for example, print mailing for which customers should be selected with a predictive model from the CPP via a forecasting model vs. standard selection method).
Align on the KPIs to be evaluated later (for example, revenue per contact).
Choose a reasonable test design (see alternatives described above).
Randomly divide your total customer base into a test group and a control group according to a predefined distribution key.
Carry Out the A/B test depending on the selected test design.
If necessary, define zero group and control group sizes using the Sample Size Calculator.
Evaluate results and check for statistical significance if necessary (for example, using a chi-square test).
Draw learnings from the A/B test and its results and iterate further if necessary.

Subscribe To Our Newsletter

A/B-Testing of Predictive Models – Guideline & Best Practices | Part 2

Uplift Comparison Test:
Comparison of the Uplift of Two Selections For the Same Customer Group

Uplift Comparison Tests For Different Customer Groups

Carrying Out an A/B Test

Evaluating A/B Tests

Checklist For Implementing A/B Tests

Categories

Recent Posts

European Union

A/B-Testing of Predictive Models – Guideline & Best Practices | Part 2

Uplift Comparison Test:Comparison of the Uplift of Two Selections For the Same Customer Group

Uplift Comparison Tests For Different Customer Groups

Carrying Out an A/B Test

Evaluating A/B Tests

Checklist For Implementing A/B Tests

Categories

Recent Posts

Uplift Comparison Test:
Comparison of the Uplift of Two Selections For the Same Customer Group