Just when you start to think that A/B testing is fairly straightforward, you run into a new strategic controversy.
This one is polarizing: how many variations should you test against the control?
There are many different opinions on this one, some completely opposite. Some of it comes down to strategy, some of it to mathematics. Some of it may depend on the stage of business you are the sophistication of your program.
No matter what the case though, it’s not really a straightforward, easy answer. Let’s start with the easy stuff: the math.
The Multiple Comparisons Problem
When you test many variations at the same time, you run into what is known as “cumulative alpha error.”
Basically, the more test variants you run, the higher the probability of false positives.
Put it this way: if you’re operating on the basis of making decisions at 95% significance, there is a 5% probability of a type 1 error (“alpha error,” or a false positive). That means that in 5% of all cases an assumption of a significant effect is made, even though in reality there is none at all.
This accumulating factor is one argument against the efficiency of the 41 Shades of Blue test from Google (though I’m sure they corrected for this error). Here’s a great visual from konversionsKRAFT to illustrate the increasing risk:
The way you can calculate the cumulative alpha is:
Cumulative alpha = 1-(1-Alpha)^k
Alpha = selected significance level, as a rule 0.05
k = number of test variants in your test (without the control)
So you can see your risk of a false positive increases drastically with each new variation. It should be obvious, then – only test one variation, right?
Well, not really. Most tools, including Optimizely, VWO, and Conductrics, have built in procedures for correcting what is known as the Multiple Comparisons Problem. They may use different techniques, but they solve for the problem.
And even if your testing tool doesn’t have a correction procedure built in, you can still correct the alpha error yourself. There are many different techniques available, and I’m not an expert on the trade-offs between them (maybe an actual statistician can chime in here):
Though in adjusting the alpha error, while you’re decreasing the risk of Type I errors, you’re increasing the risk of Type II errors (not seeing a difference when there actually is one).
In addition, Andrew Gelman wrote a great paper that states that the problem of multiple comparisons could disappear entirely when viewed from a hierarchical Bayesian perspective.
Idan Michaeli, Chief Data Scientist at Dynamic Yield, also notes that taking a Bayesian approach remedies this problem:
As Matt Gershoff, CEO of Conductrics, put it, this assumes you have a strong prior that the variations are indeed the same – all of that really leads into partial pooling of data, which Matt wrote about in this great post.
If you’re still afraid of the mathematical implications of comparing multiple means, note that you’re really doing the same thing when you’re doing post-test segmentation of your data. Chris Stucchio from VWO wrote a great article on that:
In conclusion, if you’re working with the right tool or have decent analysts, the math isn’t really the problem. The math is hard, but it’s not impossible or dangerous. As Matt Gershoff aptly put it, “the main point is not to get too hung up on which [correction] approach, just that it is done.”
Also, h/t to Matt for helping me get all of the math right here.
So, disregarding the mathematical angle, we’re left with a strategic decision. Where’s the ROI, testing as many variations as possible or limiting the scope and maybe moving more quickly to the next test?
The Case for Maximizing the Number of Variants
While most people don’t have that kind of traffic, the point remains: this is data-driven decision making. Devoid of opinion, devoid of style.
Now, accounting for traffic realities (you can’t test like Google does), is testing many variations at once the right style for you? Some say so.
This approach runs in stark contrast to what many experts advise. Not only do many people advise you to only test one element at a time (bad advice), most people say you should stick to a simple A vs B test.
So, naturally, I reached out to Andrew for some clarity. After all, his approach also seems to work for larger companies like Microsoft, Amazon, and of course, Google. Does it work for companies with less traffic, too? How applicable is the approach?
Here’s what he said:
What’s the point? Efficiency. You test this many variations, and you limit the opinions that hold back a testing program. It’s also (in my mind, these aren’t Andrew’s thoughts) sort of like how The Onion forces writers to crank out 20 headlines per article. The first few are easy, but by the last 5, you’re really pushing the boundaries and throwing away assumptions. Test lots of shit and you’re bound to get some solutions you never would have thought of otherwise.
Andrew isn’t the only one advocated for testing multiple variations of course. Idan Michaeli from Dynamic Yield said it’s tough to put a limit on the amount of variations you test. He, too, mentioned that the difference between the variations is a crucial factors, no matter how many variations you’re running.
“The more substantial the difference in appearance, the faster you can detect the difference in performance in a statistically significant way,” he said
More often than not, though, the # of variations is an “it depends” type of answer. The individual factors you’re dealing with matter much more than a set in stone strategy.
The Case for Minimizing the Number of Variations
There are many people who advocate testing fewer variations as opposed to many. Some for the mathematical reasons we discussed above, some as a means of optimization strategy.
One point: with alpha adjustments, it will almost always take longer to run a test with more variations. You may be operating on a strongly iterative approach, where you’re exploring user behavior on granular level, and you only test one or few variations at a time. Or perhaps your organization has not been testing for long, and you want to demonstrate a few quick wins without getting into the nitty gritty of ANOVA and alpha error inflation.
So, you can test adding a value proposition against your current (lack of a) value proposition. You get a quick win and can move on to increasing your testing velocity and program’s efficiency and support.
There are some other reasons, too, that people have mentioned in favor of reducing the number of variations.
And there’s the question of sample pollution as well, which occurs when a sample is not truly randomized, or users are exposed to multiple variations in a test.
If you’d like to read more on sample pollution in A/B testing, read our article on it.
Traffic and Time
Time and traffic are also a concern. How long does it take to create 10 drastically different variations versus just one? How much traffic to do you have, and how long will it take you to pull of a valid test?
Here’s how Ton put it:
Ton also mentioned that running only one variation against the control is a good way to research buyer/user motivations – basically, to explore what’s working and what’s not – and then later to exploit that through other means like bandits:
There’s a Middle Ground, Too
As she put it, “I don’t think it’s possible to give a general answer. The specific test setup depends on a number of factors (see below). From my personal experience and opinion, I would never test more than five variations (including control) at the same time.
Idan Michaeli, too, believes that it depends on a variety of factors and there isn’t a silver bullet answer:
Under the premise that there isn’t a black and white answer, how do you decide how many variations to test? Even if you believe in maximizing the variations, how do you decide how many is optimal?
What factors determine how many variations you put against the control?
It may not be smart to advise the diverse audience reading this to either test 41 shades of a color or just stick to one variation. Just as your audiences, conversions, revenue, traffic, etc. are different, so are your company structures, politics, and processes. A one size fits all answer isn’t really possible.
There are some factors to help you home in on an accurate approach, though.
According to Ton, you look at the usual factors when determining experimental design:
Dr. Julia Engelmann gives her criteria, mainly from a statistics perspective:
And as Andrew was quoted saying earlier in the article, he runs fragility models to find the sweet spot in a given context. According to him, “Even in the highest trafficked sites (and I have worked on 16 of the top 200 sites out there), the sweet spot is usually still in the 12-16 range.”
As for finding areas of opportunity and elements of impact, Andrew wrote that he has a series of different types of tests designed to maximize learning, such as MVTs, existence testing, and personalization. When he homes in on areas of impact, he tries to maximize the beta of options and, for a given solution, attempts to also test the opposite of that (which will be written about in a coming article).
Account for your resources
In addition to traffic, you have to account for your individual resources and organizational efficiency. How much time does it take your design and dev teams for a series of huge changes vs. an incremental test (41 shades of blue style)? The former is a lot, the latter almost none.
Ton, first, advises, “Please don’t do the button colour thingie, you want to learn what drives behavior and how you can motivate users to take the next step. Then again with front-end development resources like testing.agency also bolder experiments don’t have to cost the world – and low on resources can’t hold you back anymore.”
Basically, smaller changes (button color) take almost zero resources, so they are easier to test many variations. They’re also, because of minute changes that don’t fundamentally affect user behavior, less likely to show a large effect.
On the other side, radical changes take more resources, but you’re more likely to see and effect. And when you pit several radical changes against each other, you’re more likely to see the optimal (or closer to the optimal) experience.
Andrew put that well in his CXL article, “If I have 5 dollars and I can get 10, great, but if I can get 50, or 100, or a 1000, then I need to know that, and the only way I do that is through discovery and exploitation of feasible alternatives.”
As much of a bummer as this is for a tl;dr, there isn’t a black and white answer. And I don’t have a horse in this race, I’m a fan of whichever gets the best results. It depends on your traffic, conversions, audience, and company culture and process.
However, the math is, generally speaking, not a limiting factor. Moreover, you should choose based on the factors listed above. In favor of more variations, you’re avoiding limitations on ideas because of what you think will (or won’t work). If the differences between the options is big, you’re much more certain of a victory.
Limiting your variations has to do with concerns of sample pollution, traffic, and time/resource concerns.
Finally, the same organization can run both types of tests. It’s a strategic decision, not necessarily one I can make for you.