Testing tools are getting more sophisticated. Blogs are brimming with “inspiring” case studies. Experimentation is becoming more and more common for marketers. Statistical know-how, however, lags behind.
This post is filled with clear explanations of A/B testing statistics from top CRO experts. A/B testing statistics aren’t that complicated—but they are that essential to running tests correctly.
Here’s what we’ll cover (feel free to jump ahead):
- Mean, variance, and sampling;
- Statistical significance;
- Statistical power;
- Confidence intervals and margin of errors;
- Regression to the mean;
- Confounding variables and external factors.
And just in case you’re uncertain about why A/B testing statistics are so essential…
Why do I need to know A/B testing statistics?
Statistics aren’t necessarily fun to learn. It’s probably more fun to put up a test between a red and green button and wait until your testing tool tells you one of them has beaten the other.
If this is your strategy, you’re ripe for disappointment. This approach isn’t much better than guessing. Often, it ends with a year’s worth of testing but the exact same conversion rate as when you started.
Statistics help you interpret results and make practical business decisions. A lack of understanding of A/B testing statistics can lead to errors and unreliable outcomes.
Here’s an analogy from Matt:
And so it is with conversion rates. Conversion optimization is a balancing act between exploration and exploitation. It’s about balancing risk, which is a fundamental problem solved by statistics. As Ton Wesseling put it:
Knowing statistics will make you a better marketer. Learning these eight core aspects of A/B testing statistics will help you increase your conversion rates and revenue.
1. Mean, variance, and sampling
There are three terms you should know before we dive into the nitty-gritty of A/B testing statistics:
The mean is the average. For conversion rates, it’s the number of events multiplied by the probability of success (n*p).
In our coffee example, this would be the process of measuring the temperature of each cup of coffee that we sample, then dividing by the total number of cups. The average temperate should be representative of the actual average.
In online experimentation, since we can’t know the “true” conversion rate, we’re measuring the mean conversion rate of each variation.
Variance is the average variability of our data. The higher the variability, the less precise the mean will be as a predictor of an individual data point.
It’s basically, on average, how far off individual cups of coffee are from the collection’s average temperature. In other words, how close will the mean be to each cup’s actual temperature? The smaller the variance, the more accurate the mean will be as a guess for each cup’s temperature.
Many things can cause variance (e.g. how long ago the coffee was poured, who made it, how hot the water was, etc.). In terms of conversion optimization, Marketing Experiments gives a great example of variance:
The two images above are the exact same—except that the treatment earned 15% more conversions. This is an A/A test.
A/A tests, which are often used to detect whether your testing software is working, are also used to detect natural variability. It splits traffic between two identical pages. If you discover a statistically significant lift on one variation, you need to investigate the cause.
Since we can’t measure the “true conversion rate,” we have to select a sample that’s statistically representative of the whole.
In our coffee example, we don’t know the mean temperature of coffee from each restaurant. Therefore, we need to collect data on the temperature to estimate the average temperature. So, unlike comparing individual cups of coffee, we don’t measure all possible cups of coffee from McDonald’s and Starbucks. We use some of them to estimate the total.
The more cups we measure, the more likely it is that the sample is representative of the actual temperature. The variance shrinks with a larger sample size, and it’s more likely that our mean will be accurate.
Similarly, in conversion optimization, the larger the sample size, in general, the more accurate your test will be.
2. Statistical significance
Let’s start with the obvious question: What is statistical significance? As Evan Miller explains:
When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance?
Statistical significance is a major quantifier in null-hypothesis statistical testing. Simply put, a low significance level means that there’s a big chance that your “winner” is not a real winner. Insignificant results carry a larger risk of false positives (known as Type I errors).
If you don’t predetermine a sample size for your test and a stopping point (when the test will end), you’re likely get inaccurate results. Why? Because most A/B testing tools do not wait for a fixed horizon (a set point in time) to call statistical significance.
Most A/B tests oscillate between significant and insignificant at many points throughout the experiment:
Here’s an example we’ve given before. Two days after a test started, here were the results:
The variation clearly lost, right? It had a 0% chance to beat the original. Not so fast. Was it “statistically significant” according to the tool? Yes. But check out the results 10 days later:
That’s why you shouldn’t peek at results. The more you peek at the results, the more you risk what’s called “alpha error inflation” (read about it here). Set a sample size and a fixed horizon, and don’t stop the test until then.
Also, be wary of case studies that claim statistical significance yet don’t publish full numbers. Many may be “statistically significant” but have a tiny sample size (e.g. 100 users).
If you do some follow-up reading on statistical significance, you’ll likely come across the term “p-value.” The p-value is a measure of evidence against the null hypothesis (the control in A/B testing parlance).
Matt Gershoff gave a great example and explanation in a previous article:
Formally, the p-value is the probability of seeing a particular result (or greater one) from zero, assuming that the null hypothesis is true. If “null hypothesis is true” is confusing, replace it with, “assuming we had really run an A/A test.”
If our test statistic is in the “surprising” region, we reject the null (reject that it was really an A/A test). If the result is within the “not surprising” area, then we fail to reject the null. That’s it.
What you actually need to know about P-Values
The p-value does not tell you the probability that B is better than A. Similarly, it doesn’t tell you the probability that you’ll make a mistake if you implement B instead of A. These are common misconceptions.
Remember, the p-value is just the probability of seeing a result (or more extreme one) given that the null hypothesis is true. Or, “How surprising is this result?”
Small note: There’s a large debate in the scientific community about p-values. This comes primarily from the controversial practice of “p-hacking” to manipulate the results of an experiment until it reaches significance (so the author can get published).
4. Statistical Power
While statistical significance is the term you’ll hear most often, many people forget about statistical power. While significance is the probability of seeing an effect when none exists, power is the probability of seeing an effect when it does actually exist—the sensitivity of your test.
When you have low power levels, there’s a bigger chance that you’ll “miss” a real winner. Evan Miller put together a great chart to explain the differences:
Statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down.
Four main factors affect the power of any test for statistical significance:
- Effect size;
- Sample size (n);
- Alpha significance criterion (α);
- Statistical power, or the chosen or implied beta (β).
For practical purposes, all you really need to know is that 80% power is the standard for testing tools. To reach that level, you need either a large sample size, a large effect size, or a longer duration test.
As Wesseling notes:
One caveat: If your test lasts too long, you risk sample pollution. Read this post to learn more.
5. Confidence intervals and margin of errors
Since statistics is inferential, we use confidence intervals to mitigate the risk of sampling errors. In that sense, we’re managing the risk associated with implementing a new variation. So if your tool says something like, “We are 95% confident that the conversion rate is X% +/- Y%,” then you need to account for the +/- Y% as the margin of error.
One practical application is to watch if confidence intervals overlap. As Michael Aagaard puts it:
So, the conversion range can be described as the margin of error you’re willing to accept. The smaller the conversion range, the more accurate your results will be. As a rule of thumb, if the two conversion ranges overlap, you’ll need to keep testing in order to get a valid result.
John Quarto has a great visual explaining confidence intervals:
Confidence intervals shrink as you collect more data, but at a certain point, there’s a law of diminishing returns.
Reading right to left, as we increase our sample size, our sampling error falls. However, it falls at a decreasing rate, which means that we get less and less information from each addition to our sample.
Now, if you were to do further research on the subject, you might be confused by the interchangeability of the terms “confidence interval” and “margin of error”. For all practical purposes, here’s the difference: The confidence interval is what you see on your testing tool as “20% +/- 2%,” and the margin of error is “+/- 2%.”
Matt Gershoff gave an illustrative example:
6. Regression to the mean
A common question one might have when first testing is, “What is the reason for the wild fluctuations at the beginning of the test?” Here’s what I mean:
What’s happening is a regression to the mean. A regression to the mean is “the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement.”
A great example comes from Wikipedia:
Imagine you give a class of students a 100-item true/false test on a subject. Suppose all the students choose their answers randomly. Then, each student’s score would be a realization of independent and identically distributed random variables, with an expected mean of 50. Of course, some students would score much above 50 and some much below.
So say you take only the top 10% of students and give them a second test where they, again, guess randomly on all questions. Since the mean would still be expected to be near 50, it’s expected that the students’ scores would regress to the mean—their scores would go down and be closer to the mean.
In A/B testing, it can happen for a variety of reasons. If you’re calling a test early based only on reaching significance, it’s possible that you’re seeing a false positive. And it’s likely that your “winner” will regress to the mean.
A related topic is the novelty effect. That’s when the novelty of your changes (e.g. bigger blue button) brings more attention to the variation. With time, the lift disappears because the change is no longer novel.
Adobe outlined a method to distinguish the difference between a novelty effect and actual inferiority:
To determine if the new offer underperforms because of a novelty effect or because it’s truly inferior, you can segment your visitors into new and returning visitors and compare the conversion rates. If it’s just the novelty effect, the new offer will win with new visitors. Eventually, as returning visitors get accustomed to the new changes, the offer will win with them, too.
The key to learning in A/B testing is segmenting. Even though B might lose to A in the overall results, B might beat A in certain segments (e.g. organic, Facebook, mobile, etc). For segments, the same stopping rules apply.
Make sure that you have a large-enough sample size within each segment. Calculate it in advance; be wary if it’s less than 250–350 conversions per variation within each segment.
As André Morys said in a previous article, searching for lifts within segments that have no statistical validity is a big mistake:
You can learn a lot from segmenting your test data, but make sure you’re applying the same statistical rules to smaller data sets.
8. Confounding variables and external factors
There’s a challenge with running A/B tests: The data is “non-stationary.”
A stationary time series is one whose statistical properties (mean, variance, autocorrelation, etc.) are constant over time.
For many reasons, website data is non-stationary, which means we can’t make the same assumptions as with stationary data. Here are a few reasons data might fluctuate:
- Day of the week;
- Press (positive or negative);
There are many more, most of which reinforce the importance of testing for full weeks. You can see this for yourself. Run a conversions per Day of Week report on your site to see how much fluctuation there is:
You can see that Saturday’s conversion rate is much lower than Thursday’s. So if you started the test on a Friday and ended on a Sunday, you’d skew your results.
If you’re running a test during Christmas, your winning test might not be a winner by the time February comes. Again, this is product of web data being nonstationary. The fix? If you have tests that win over the holidays (or during promotions), run repeat tests during “normal” times.
External factors definitely affect test results. When in doubt, run a follow-up test (or look into bandit tests for short promotions).
Learning the underlying A/B testing statistics allows you to avoid mistakes in test planning, execution, and interpretation. Here are some testing heuristics:
- Test for full weeks.
- Test for two business cycles.
- Make sure your sample size is large enough (use a calculator before you start the test).
- Keep in mind confounding variables and external factors (holidays, etc.).
- Set a fixed horizon and sample size for your test before you run it.
- You can’t “see a trend.” Regression to the mean will occur. Wait until the test ends to call it.
Working on something related to this? Post a comment in the CXL community!