The A/B Testing Paradox

Jars and vials

Has this ever happened to you:

  • You ran 30 a/b tests over 6 months
  • 20 of which showed an average lift of 2%, and you promoted the winning variant to show to the entire traffic base. Yay!! 10 of the tests did not show a lift over the base configuration, and you killed the variant you were trying out. No big deal, of course.
  • 6 months later, you would expect that your business is up ~40% from the 20 winning variants.
  • Yet, overall revenue has hardly improved.

Maybe seasonality is the issue (i.e., revenue improved over the base, but this time of the year, seasonality leads to a 40% decline anyway)? Or maybe there were other factors that would have driven the business down 40%, and your a/b tests neutralized that?

While possibilities like these could be true, an often overlooked fact is that our understanding of how to measure “lift” through A/b testing may be flawed. Here are some ways in which A/b testing might flatter only to deceive, some of which can be solved by improving your testing methodology:

    1. You never bothered to understand the extent of random fluctuations in metrics: Consider an experiment where you toss a coin 100 times twice. Just through sheer randomness, you are likely to get anything but 50 heads twice – say you might get 49 heads once, and 51 heads the other time – a . A null test, which randomly slices users at all times, and measured the variance between the 2 slices, gives you the the variance in metrics that’s purely random. Without accounting for this random variance, you would never be able to measure the true lift caused by your experiment’s variations. It’s a good idea to create a dummy test (where the two variants are entirely the same), and running it at all times. When you run other a/b tests, make sure that they show a bigger observed deviation than the null test.
    2. The tests were not independent: One of the assumptions in an a/b testing system is that you can run multiple tests at the same time, and that those tests are independent of each other. This assumption often breaks down when you run tests that apply only to a sub-segment of users. For example, consider a test X (with variants XA and XB) that triggers only on a segment of users who have clicked the “buy” button on an ecommerce site; consider that it’s running alongside a test Y (with variants YA and YB), where YB users show a higher propensity to click on the buy button. As a result, the population that sees experiment X is now composed of more YB users, breaking the independence.This is a really hard pitfall to avoid in a/b testing, but can be resolved through multivariate testing. One way to minimize the impact of this issue in a/b tests is to run only a few tests in parallel, and to be careful when running “filtered” tests (tests that only get triggered on a small segment of qualifying users).
    3. You didn’t run the test long enough before you declared victory: A test needs a sufficient amount of observations before you know for sure if the measurements are statistically significant. So how long do you need to run the tests to be able to be confident in the results? The answer is that it depends on the following:
        1. The value of the current metric you’re looking to optimize. e.g. if the metric is conversion rate, and the current value is 5%, you will need to run the test for more time than if it’s 3%. The reason is that a 1% improvement in the conversion rate has a higher chance of being “noise” if the baseline is larger, around 5%, than if the baseline is smaller, around 3%.
        2. The minimum difference you hope to detect: If you want to detect a statistically significant lift of 0.1%, you will need to run the test for much longer than if you wanted to detect a statistically significant lift of 1%
        3. How much you want to minimize the risk of false positives and false negatives: If you hope to reduce false positives, or in other words, increase the statistical significance, you would need to run the test for longer. Similarly, to reduce the risk of false negatives, or to increase the statistical power, you will need to run the tests for longer.

This sample size calculator makes it easy to figure out how many observations you need.

As you internalize some of these observations to your testing practices, you will find fewer tests “succeeding”, but the ones that will succeed will produce sustainable results. Happy testing!