The title may seem a bit controversial, a fairly common question I get from large (and small) companies is—“Should I run A/A tests to check whether my experiment is working?”
The answer might surprise you.
I’ve been doing split and multivariate tests since 2004, and have watched every single one like a hawk. I’ve personally made every test in the book and wasted many days, in order to become better and to continue to hopefully improve my ability to run valid tests.
What does my experience tell me here?
Table of contents
- There are better ways to use your precious testing time
- What kinds of A/A tests are there?
- The dirty secret of testing
- Triangulate your data
- Watch it like a chef
There are better ways to use your precious testing time
It’s important to note, I don’t want to come across as saying running an A/A test is wrong —just that my experience tells me that there are better ways to use your time when testing. Just as there are many ways to lose weight, there are optimal ways to run your tests.
While the volume of tests you start is important, how many you finish every month and how many from those that you learn something useful from matters most.
Running A/A tests can eat into ‘real’ testing time.
The trick of a large scale optimisation programme, is to reduce the resource cost to opportunity ratio, to ensure velocity of testing throughput and what you learn, by completely removing wastage, stupidity and inefficiency from the process.
Running experiments on your site is a bit like running a busy Airline at a major International Airport—you have limited take-off slots and you need to make sure you use them effectively.
We’ll cover a lot of ground, including:
- What kinds of A/A test are there?;
- Why do people do them?;
- Why is this a problem?;
- The Dirty Secret of split testing;
- Triangulate your data;
- Watch the test like a Chef;
- Machine Learning & Summary.
What kinds of A/A tests are there?
A/A – It’s a 50/50 split
The most common setup here is just a 50/50 split in testing exactly the same thing. Yup. We run a test of the original page against itself. Why?
The idea here is to validate the test setup by seeing that you get roughly the same performance from each variant. You’re testing the same thing against itself, to see if there’s noise in the data, instead of signal.
In a coin flipping example, you’re testing that if you flip the coin a number of times, it will come out equally in terms of heads and tails. If the coin was weighted (like a magician’s special coin) then running the exercise would let you know there was some noticeable bias.
Running an A/A is about validating the test setup. People basically use this to test the site to see if the numbers line up.
The problem is, that this takes time that would normally be used to run a split test. If you have a high traffic site, you might think this is a cool thing to do—but in my opinion, you’re just using up valuable test insight time.
In my experience, it’s a lot quicker just to properly test your experiments before going live. It also gives you confidence in your test where A/A frippery may inject doubt.
What do I recommend then?
- Cross browser testing;
- Device testing;
- Friends & Family;
- Analytics Integration;
- Watch the test closely;
This approach is a lot quicker and has always worked best for me, rather than running A/A tests. Use triangulated data, obsessive monitoring and solid testing to pick up on instrumentation, flow or compatibility problems that will bias your results, instead of using A/A test cycles.
The big problem that people never seem to recognise is that flow, presentation, device or browser bugs are the most common form of bias in A/B testing.
A/A/B/B – 25% Splits
OK—what’s this one then? Looks just like an A/B test to me, except it isn’t. We’ve now split the test 25% into 4 samples, which happen to contain both A and B in duplicated segments.
So what’s this supposed to solve? It’s to check the instrumentation again (like A/A) but also confirm if there are oddities in the outcomes. I get the A/A validation part (which I’ve covered already) but what about the results looking different—in your two A and B samples.
But what if they don’t line up perfectly? Who cares—you’re looking at the wrong thing anyway – the average.
Let’s imagine you have 20 people come to the site, and five each of them end up in the sample buckets. What if five of these are repeat visitors and end up in one sample bucket? Won’t that skew the results? Hell yes. But that’s why you should never look at small sample sizes for insight.
So what have people found using this? That the sample performance does indeed move around and especially so early in the test or if you have small numbers of conversions. I tend to not trust anything until I’ve hit 350 outcomes in a sample and at least two business cycles (e.g. weekly) as well as other factors.
The problem with using this method is you’ve split A and B into 4 buckets, so the effect of skew is more pronounced, your effective sample size is smaller and therefore the error rate (fuzziness) of each individual sample is higher.
Put simply, the chances that you’ll get skew are higher than if you’re just measuring one A and B bucket. It also means that because your sample sizes are smaller, the error rate (the +/-) stuff will be higher on each measurement.
If you tried A/A/A/B/B/B You’d just magnify the effect. The problem is to know when the samples have stopped moving around—this is a numbers thing but also done a lot by feel for the movements in the test samples. The big prize is not about how test results fluctuate between identical samples (A/A/B/B)—it’s about how visitor segments fluctuate (covered below).
A/B/A – A better way
Suggested by @danbarker this works to help identify instrumentation issues (like A/A) but without eating into as much test time.
This has the same problem as A/A/B/B in that the two A samples are smaller and therefore have higher error rates. Your reporting interface is also going to be more complex, as you now have 3 (or in A/A/B/B, 4 lines) of numbers to crunch.
You also have the issue that as the samples are smaller, it will take longer for the two A variants to settle than a straight A/B. Again, a trade-off of time versus validation – but not one I’d like to take.
If you really want to do this kind of test validation, I think Dan’s suggestion is the best one. I still think there is a bigger prize though—and that’s segmentation.
Why do people do A/A tests?
Sometimes it’s because it is seen as ‘statistics good practice’ or a ‘hallmark’ of doing testing properly.
It’s also seen as a clean way of running the test to have a dry run before the main event. For me, the cost of fixing the car whilst I’m driving it (running live tests) is far higher than when stationary in the garage (QA).
For me, getting problems out of ANY testing is the priority and A/A doesn’t catch test defects like QA work does. It might be worth running one if you’re bedding in some complex code that the developers can re-use on later tests. I just can’t recommend doing A/A for every test.
What’s the problem then?
The problem is always eating real traffic and test time, by having to preload the test runtime with a period of A/A testing. If I’m trying to run 40 tests a month, this will cripple my ability to get stuff live. I’d rather have a half day of QA testing on the experiment than run 2-4 weeks of A/A testing to check it lines up.
The other problem is that nearly 80% of A/A tests will reach significance at some point. In other words, the test system will conclude that the original is better than the original with a high degree of confidence!
Why? Well, it’s a numbers and sampling thing but it’s also because you’re reading the test wrong. If you have small samples, it’s quite possible that you’ll conclude that something is broken when it’s not.
The other problem is—when you’re A/A testing—you’re comparing the performance of two things that are identical. The amount of sample and data you need to prove that there is no significant bias is huge by comparison with an A/B test.
How many people would you need in a blind taste testing of Coca-Cola (against Coca-Cola) to conclude that people liked both equally? 500 people, 5000 people?
And this is why we don’t test very similar things in split tests—detecting marginal gains is very hard and when you test identical things, this is even more pronounced. You could run an A/A test for several weeks LONGER than the A/B test itself and get no valuable insight, either on whether the test was broken or your ability to understand sampling <grin>
A good example here is that people who espouse A/A testing forget another bias in running tests. The Slower Converter and the Novelty Effect.
If you run a test for two weeks and your average purchase cycle is four weeks, you’re going to cut off some visitors to the experiment when you close the test. This is why it’s important to know your purchase cycle time to conversion as you might run an A/B test that only captures ‘fast converters’.
Ton Wesseling always recommends ( and I agree) that you leave an experiment running when you ‘close’ it to new visitors. That way, people who’re part way through converting can continue to see the experiment and convert after the end of the test. This is a way of letting the test participants flush through the system and add more sample, without showing it to new people.
If you’re optimising the end of the testing cycle, by understanding purchase cycles, isn’t there some sort of bias at the start of testing?
Well part of this is ‘Regression toward the mean’ which we see in tests for all sorts of things, and the second part is the novelty effect.
If James has been visiting the website for four weeks and is about to purchase, he’s been seeing the old product page for all that time. On his final visit before converting, he sees a brand new shiny product page that’s much better and is influenced to buy. Your friend Bob, meanwhile, has been seeing the same page for four weeks and when he arrives, he still gets the old (control) version.
This means that new people into the experiment also contains ‘old’ visitors who are later in their lifecycle. This novelty spike can bias the data early in your test—at least until some of the cycles are flushed through the experiment. In theory, you ought to start the test a few weeks early and cookie all visitors, so you can only put new visitors into your experiment, not those who might be hit by a novelty effect late in the purchase cycle, for example.
My point of showing these two examples, is that there are loads of sources of bias in our split testing. A/A testing might spot some big biases but I find that it’s inefficient and doesn’t answer everything that QA, analytics integration and segmentation can.
The dirty secret of testing
Every business I’ve tested with has a different pattern, randomness or cycle to it—and that’s part of the fun. Watching and learning from the site and test data during live operation is one of the best parts for me. But there is a dirty secret in testing – that 15% lift you got in January? You might not have it any more!
Why? Well you might have cut your PPC budget since then, driving less warm leads into your business. You might have run some TV ads that really put people off that previously responded well to your creative.
It might be performing much better than you thought. But you don’t actually know!
It’s the Schrödinger’s Cat of split testing—you don’t know unless you retest it, whether it’s still driving the same lift. This is the problem with sequential rather than dynamic testing—you have to keep moving the needle up and you don’t know if a test lift from an earlier experiment is still delivering.
You leave a stub running
To get around the fact that creative performance moves, I typically leave a stub running (say 5-10%) to keep tracking the old control (loser) against the new variant (winner) for a few weeks after the test finishes.
If the CFO shows me figures disputing the raise—I can show that it’s far higher than the old creative would have performed, if I had left it running. This has been very useful, at least when bedding a new tool in with someone who distrusts the lift until they ‘see it coming through’ the other end!
However, if you’re just continually testing and improving—all this worry about the creative changes becomes rather academic—because you’re continually making incremental improvements or big leaps.
The problem is where people test something and then STOP – this is why there are some pages I worked on that are still under test 4 years later – there is still continual improvement to be wrought even after all that time.
Products like Google Content Experiments (built into Google Analytics) and Conductrics now offer the multi-armed bandit algorithm to get round this obvious difference between what the creative did back then vs. now (by adjusting the stuff shown to visitors as their behavioural response changes).
I postulated back in 2006 that this was the kind of tool we needed – something that dynamically mines the web data, visit history, backend data and tracking into a personalised and dynamic split test serving system. Something that can look at all the customer attributes, targeting, advertising, recommendations, personalisation or split tests—and know what to show someone, at what time. Allowing this system to self-tune (with my orchestration and fresh inputs) looks like the future of testing to me:
[Reference article: Multi Armed bandits]
Triangulate your data
One thing that’s really helped me to avoid instrumentation and test running issues is to run at least two analytics sources. Make sure you completely use the split testing software capabilities to integrate with a second analytics package as a minimum.
Doing so will allow you to have two sources of performance data to triangulate or cross check with each other. If these don’t line up proportionally or look biased to an analyst’s eye, this can pick up reporting issues before you’ve started your test. I’ve encountered plenty of issues with AB testing packages not lining up with what the site analytics said—and it’s always been a developer and instrumentation issue. You simply can’t trust one set of experiment metrics—you need a backup to compare against, in case you’ve broken something.
Don’t weep later about lost data—just do your best to make sure it doesn’t happen. It also helps as a belt and braces monitoring system for when you start testing – again so you can keep watching and checking the data.
[Reference article: How to Analyze Your A/B Test Results with Google Analytics]
Watch it like a chef
You need to approach every test like a labour intensive meal, prepared by a Chef. You need to be constantly looking, tasting, checking, stirring and rechecking things as it starts, cooks and gets ready to finish. This is a big insight that I got from watching lots of tests intensely—you get a better feel for what’s happening and what might be going wrong.
Sometimes I will look at a test hundreds of times a week – for no reason other than to get a feel for fluctuations, patterns or solidification of results. You have to resist the temptation to be drawn in by the pretty graphs during the early cycle of a test.
If you’re less than one business cycle (e.g. a week) into your test – ignore the results. If you have less than 350 and certainly 250 in each sample – ignore the results. If the samples are still moving around a lot then – ignore the results. It’s not cooked yet.
Anyone with solid test experience knows that your data and response is moving around constantly—all the random visitors coming into the site and seeing creatives, is constantly changing the precision and nature of the data you see.
The big problem with site averages for testing is that you’re not looking inside the average—to the segments. A poorly performing experiment might have absolutely rocked—but just for returning visitors. Not looking at segments will mean you miss that insight.
Having a way to cross instrument your analytics tool (with a custom variable, say in GA) will allow you then to segment the creative level performance. One big warning here—if you split the sample up, you’ll get small segments.
If you have an A and a B creative, imagine them as two large yellow space hoppers, sitting above a tennis court. You are in the audience seating and you’re trying to measure how far they are apart. They aren’t solid spaces but are fuzzy – you can’t see precisely where the centre is – just a fuzzy indistinct area in space.
Now as your test runs, the position and size of these space hoppers shrinks, so you can be more confident about their location and their difference in height, for example. As you get toward the size of a tennis ball, you’re much more confident about their precise location and can measure more precisely how far apart they are.
Be wary of small sample sizes
If you split up your A and B results into a segment, you hugely increase the size of how fuzzy your data is. So be careful not to segment into tiny samples or just be careful about trusting what the data tells you at small numbers of conversions or outcomes.
Other than that, segmentation will tell you much more useful stuff about A/B split testing than any sample splitting activity—because it works at the level of known attributes about visitors, not just the fuzziness of numbers. When I get a test that fails, that should be of some insight to me but I always mine the segments to see what drove the average. That provides me with key insight about not only my hypothesis but how different groups reacted to my experiment.
And this is the most important bit—when you get a test that comes out ‘about the same’ as the original, guaranteed there is a segment level driver that will be of interest to you. The average test result might have come out ‘about the same’ but the segment level response likely contains useful insight for you.
Is the future of split testing in part automation? Yes—I think so. I think these tools will help me run more personalised and segment driven tests – rather than trying to raise the ‘average’ visitor performance. I also think they remove the need to have tools for personalisation, targeting, split and multi-variate testing – basically all experiments with ‘trying stuff on people’.
The tools will simply help the area of experimentation I can cover, a bit like going from using a plough to having a tractor. I don’t think it reduces the need for human orchestration of tests – just helps us do much more at scale than we could ever imagine doing manually.
What’s the best way to avoid problems? Watching the test like a hawk and opening up your segmentation will work wonders. QA time is also a wise investment as it beats the existential angst hands down, when you have to scrap that useless data four weeks later.
And thanks to many people for having the questions and insights that made me think about this stuff. A hat tip to @danbarker, @peeplaja, @mgershoff, @distilled and @timlb for refining my knowledge and prompting me to write this article.