A very common scenario: A business runs tens and tens of A/B tests over the course of a year, and many of them “win.” Some tests get you 25% uplift in revenue, or even higher.
Yet when you roll out the change, the revenue doesn’t increase 25%. And 12 months after running all those tests, the conversion rate is still pretty much the same. How come?
The answer is this: Your uplifts were imaginary. There was no uplift to begin with.
Yes, your testing tool said you had a 95% statistical significance level (or higher). Well, that doesn’t mean much. Statistical significance and validity are not the same.
Table of contents
- Statistical significance is not a stopping rule.
- Run your tests longer.
- The stopping rules
Statistical significance is not a stopping rule.
When your testing says that you’ve reached a 95% or even 99% confidence level, that doesn’t mean that you have a winning variation.
Here’s an example I’ve used before. Two days after starting a test, these were the results:
The variation I built was losing badly—by more than 89% (and no overlap in the margin of error). It says here that Variation 1 has a 0% chance to beat the Control.
So a 100% significant test, and 852.8% percent uplift (or, rather, the Control is over >800% better that the treatment). Let’s end the test, shall we? Control wins! Or how about we give it some more time instead?
This is what it looked like 10 days later:
That’s right, the variation that had a 0% chance of beating control was now winning with 95% confidence. What’s up with that? How come “100% signficance” and “0% chance of winning” became meaningless? Because they are.
If you end the test early, there’s a great chance that you’ll pick the wrong winner. In this scenario, many (most?) businesses still go ahead and implement the change (i.e. roll out the winning variation to 100% of traffic), while, in fact, the 800% lift becomes zero, or even negative (losing).
Even worse than the imaginary lift is the false confidence that you now have. You think you learned something and go on applying that learning elsewhere on the site. But the learning is actually invalid, thus rendering all your efforts and time a complete waste.
It’s the same with the second test screenshot (10 days in)—even though it says 95% significance, it’s still not “cooked.” The sample is too small; the absolute difference in conversions is just 19 transactions. That can change in a day.
Ton Wesseling has this to say about it:
You should know that stopping a test once it’s significant is deadly sin number 1 in A/B-testing land. Seventy-seven percent of A/A-tests (same page against same page) will reach significance at a certain point.
Learn what significance really is.
Statistical significance alone shouldn’t determine whether you end a test or not. It’s not a stopping rule.
Statistical significance doesn’t tell us the probability that B is better than A. Nor does it tell us the probability that we will make a mistake in selecting B over A.
These are both extraordinarily commons misconceptions, but they’re false. To learn what p-values are really about, read this post.
Run your tests longer.
If you stop your tests after a few days, you’re doing it wrong. It doesn’t matter if you get 10,000 transaction per day. The absolute number of transactions matters, but you also need pure time.
One of the difficulties with running tests online is that we are not in control of our user cohorts. This can be an issue if the users distribute differently by time and day of week, and even by season. Because of this, we probably want to make sure that we collect our data over any relevant data cycles. That way, our treatments are exposed to a more representative sample of the average user population.
Notice that segmentation doesn’t really get us out of this, since we will need to still sample over the weekdays, weekends, etc., and we probably want to hit each day or day part a couple of times to average out and external events that could be affecting traffic flow/conversion in order to get good estimates of impact time based features/segments on conversion.
I see the following scenario all the time:
- First couple of days: B is winning big. Typically due to the novelty factor.
- After week #1: B winning strong.
- After week #2: B still winning, but the relative difference is smaller.
- After week #4: Regression to the mean—the uplift has disappeared.
So, if you stop the test before four weeks (maybe even after a few days), you think you have a winning variation, but you don’t. If you roll it out live, you have what I call an “imaginary lift.” You think you have a lift because your testing tool showed >25% growth, but you don’t see growth in your bank account.
Run your tests longer. Make sure they include two business cycles, have enough absolute conversions/transactions, and have had enough duration timewise.
Example of an imaginary lift
Here’s a test that we ran for an ecommerce client. The test duration was 35 days, targeted desktop visitors only, and had close to 3,000 transactions per variation.
Spoiler: The test ended with “no difference.” Here’s the Optimizely overview for revenue (click to enlarge):
Let’s see now:
- First couple of days: Blue (variation #3) is winning big—like $16 per visitor vs. $12.5 for Control (#Winning!). Many people end the test here (#Fail!).
- After 7 days: Blue still winning, and the relative difference is big.
- After 14 days: Orange (#4) is winning!
- After 21 days: Orange still winning!
- End: No difference.
So, had you run the test for less than four weeks, you would’ve called the winner wrong.
The stopping rules
So, when is a test cooked?
Alas, there is no universal heavenly answer out there, and there are a lot of “it depends” factors. That said, you can have some pretty good stopping rules for most cases.
Here are my stopping rules:
- Test duration of at least 3 weeks (better if 4).
- Minimum pre-calculated sample size reached (using different tools). I will not believe any test that has fewer than 250–400 conversions per variation.
- Statistical significance of at least 95%.
It might be different for some tests because of peculiarities, but in most cases, I adhere to those rules.
Here’s Wesseling chiming in again:
You want to test as long as possible—at least one purchase cycle—the more data, the higher the statistical power of your test!
More traffic means you have a higher chance of recognizing your winner on the significance level you’re testing on! Small changes can make a big impact, but big impacts don’t happen too often—most of the time, your variation is slightly better, so you need a lot of data to notice a significant winner.
But if your tests last and last, people tend to delete their cookies (10% in two weeks). When they return in your test, they can end up in the wrong variation—so, when the weeks pass, your samples pollute more and more and end up with the same conversion rates.
Test for a maximum of 4 weeks.
What if—after three or four weeks—the sample size is less than 400 conversions per variation?
I let the test run longer. If, by 4 weeks time, the sample size isn’t achieved, I add another week.
Always test full weeks. If you start the test on a Monday, it should end on a Sunday. If you don’t test a full week at a time, you might skew your results.
Run a conversions-per-day-of-the-week report on your site. See how much fluctuation there is.
What do you see below? Thursdays make 2X more money than Saturdays and Sundays, and the conversion rate on Thursdays is almost 2X better than on a Saturday.
If we didn’t test for full weeks, the results would be inaccurate. Test full weeks at a time.
The same stopping rules apply for each segment.
Segmenting is key to learning from A/B tests. It’s common for B to lose to A in overall results but beat A in certain segments (e.g., Facebook traffic, mobile device users, etc.).
Before you can analyze any segmented data, make sure that you have a large-enough sample size within each segment. So, you need 250–400 conversions per variation within each segment you’re looking at.
I even recommend that you create targeted tests (set target audience/segment in the test configuration) instead of analyzing the results across segments after a test. This ensures that tests aren’t called early, and each segment has adequate sample size.
My friend André Morys said this about his stopping rules:
I always tell people that you need a represantative sample if your data should be valid.
What does “representative” mean? First of all, you need to include all the weekdays and weekends. You need different weather because it impacts buyer behavior. But most importantly, your traffic needs to have all traffic sources, especially newsletter, special campaigns, TV…everything! The longer the test runs, the more insights you get.
We just ran a test for a big fashion retailer in the middle of the summer sale phase. It was very interesting to see how the results dramatically dropped during the “hard sale phase” with 70% and more—but recovered one week after the phase ended. We would never have learned this if the test hadn’t run for nearly four weeks.
Our “rule of thumb” is this: 3,000–4,000 conversions per variation and 3–4 week test duration. That is enough traffic so we can even talk about valid data if we drill down into segments.
Testing “sin” number 1: Searching for uplifts within segments when you have no statistical validity (e.g., 85 vs. 97 conversions). That’s bullshit.
Learning from tests is super important—even more than getting wins. And segmenting your test data is one of the better ways to learn. Just make sure your segments have enough data before you jump to conclusions.
Just because your test hits 95% significance level or higher, don’t stop the test. Pay attention to the absolute number of conversions per variation and test duration as well.