Three Hard Truths About A/B Testing

Peep Laja

11 years ago

Sometimes A/B testing is made to seem like some magical tool that will fix all problems at once. Conversions low? Well run a test and increase your conversions by 12433%! It’s easy!

Setting up and running split tests is indeed easy (if you’re using the right tools), but doing it right requires thought and care.

1. Most A/B tests won’t produce huge gains (and that’s okay)
2. There’s a lot of waiting (until statistical confidence)
3. Trickery doesn’t provide serious lifts, understanding the user does
Conclusion

1. Most A/B tests won’t produce huge gains (and that’s okay)

I’ve read the same A/B testing case studies as you have. Probably more. One huge gain after another – or so it seems. The truth is that vast majority of tests are never published. Like most people who try to make it in Hollywood are people you won’t ever hear about, you don’t know about most A/B tests.

Most split test “fail” in the sense that they won’t result in a lift in conversions (new variations produce either no change or perform poorer). Appsumo founder Noah Kagan has said this about their experience:

Only 1 out of 8 A/B tests have driven significant change.

It’s a good expectation to have. If you’re expecting every test to be a home run, you’re setting yourself up for some unhappy times.

Think of it as process of continuous improvement

Conversion optimization is a process. And improving conversions is like getting better at anything – you have to do it again and again. It often takes many tests to gain valuable insights about what works and what doesn’t. Every product and audience is different, and even the best research will only take us so far. Ultimately we need to test our hypothesis in the real world and gain new insights from the test results.

It’s about incremental gains

A realistic expectation to have is that you’ll achieve a 10% gain here and 7% gain there. In the end all of these improvements will add up.

Image credit

“Failed” experiments are for learning

I’ve had a ton of cases where I came up with a killer hypothesis, re-wrote the copy and made the page much more awesome – only to see it perform WORSE than the control. Probably you’ve experienced the same.

Unless you missed some critical insights in the process, you can usually turn “failed” experiments into wins.

I have not failed 10,000 times. I have successfully found 10,000 ways that will not work.

– Thomas Alva Edison, inventor

The real goal of A/B tests is not a lift in conversions (that’s a nice side effect), but learning something about your target audience. You can take those insights about your users and use it across your marketing efferts – PPC ads, email subject lines, sales copy and so on.

Whenever you test variations against the control, you need to have a hypothesis as to what might work. Now when you observe variations win or lose, you will be able to identify which elements really make a difference.

When a test fails, you need to

evaluate the hypotheses,
look at the heat map / click map data to assess user behavior on the site,
pay attention to any engagement data – even if users didn’t take your most wanted action, did they do anything else (higher clickthroughs, more time on site etc).

Here’s a case study on how they turned a losing variation around by analyzing what exactly doesn’t work on it.

2. There’s a lot of waiting (until statistical confidence)

A friend of mine was split testing his new landing page. He kept emailing me his results and findings. I was happy he performed so many tests, but he started to have “results” way too often. At one point I asked him “How long do you run a test for?” His answer: “until one of the variations seems to be winning”.

Wrong answer. If you end the test too soon, there’s a high chance you’ll actually get wrong results. You can’t jump to conclusions before you reach statistical confidence.

Statistical significance is everything

Statistical confidence is the probability that a test result is accurate. Noah from 37Signals said it well:

Running an A/B test without thinking about statistical confidence is worse than not running a test at all—it gives you false confidence that you know what works for your site, when the truth is that you don’t know any better than if you hadn’t run the test.

Most researchers use the 95% confidence level before making any conclusions. At 95% confidence level the likelihood of the result being random is very small. Basically we’re saying “this change is not a fluke or caused by chance, it probably happened due to the changes we made”.

If the results are not statistically significant, the results might be caused by random factors and there’s no relationship between the changes you made and the test results (this called the null hypothesis).

Calculating statistical confidence is too complex for most, so I recommend you use a tool for this.

Beware of small sample sizes

I started a test for a client. 2 days in, these were the results:

The variation I built was losing bad – by more than 89%. Some tools would already call it and say statistical significance was 100%. The software I used said Variation 1 has 0% chance to beat Control. My client was ready to call it quits.

However since the sample size here was too small (only a little over 100 visits per variation) I persisted and this is what it looked like 10 days later:

That’s right – the variation that had 0% chance of beating control was now winning with 95% confidence.

Don’t make conclusions based on a very small sample size. A good ballpark is to aim for at least 100 conversions per variation before looking at statistical confidence (although a smaller sampler might be just fine in some cases). Naturally there’s a proper statistical way to go about determining the needed samples size, but unless you’re a data geek, use this tool (it will say statistical confidence N/A if proper sample size not achieved).

Watch out for A/B testing tools “calling it early”, always double check the numbers. Recently Joanna from Copy Hackers posted about her experience with a tool declaring a winner too soon. Always pay attention to the margin of error and sample size.

Patience, my young friend

Don’t be discouraged by the sample sizes required – unless you have a very high traffic website, it’s always going to take longer than you’d like. Rather be testing something slowly than to testing nothing at all. Every day without an active test is a day wasted.

3. Trickery doesn’t provide serious lifts, understanding the user does

I liked this tweet by Naomi Niles:

Next time I see an article telling people to increase their conversion rate by using one color instead of another, I’m going to cry.

— Naomi Niles (@NaomiNiles) December 10, 2012

I couldn’t agree more. This kind of narrative gives people the wrong idea about what testing is about. Yes sure – sometimes the color affects results – especially when it affects visual hierarchy, makes the call to action stand out better and so on. But “green vs orange” is not the essence of A/B testing. It’s about understanding the target audience. Doing research and analysis can be tedious and it’s definitely hard work, but it’s something you need to do.

Image source

In order to give your conversions a serious lift you need to do conversion research. You need to do the heavy lifting.

Serious gains in conversions don’t come from psychological trickery, but from analyzing what your customers really need, the language that resonates with them and how they want to buy it. It’s about relevancy and perceived value of the total offer.

Conclusion

1. Have realistic expectations about tests.

2. Patience, young grasshopper.

3. A/B testing is about learning. True lifts in conversions come from understanding the user and serving relevant and valuable offers.

[grwebform url=”https://app.getresponse.com/view_webform_v2.js?u=Td&webforms_id=6203803″ css=”on” center=”off” center_margin=”200″/]

Featured image credit

Join 95,000+ analysts, optimizers, digital marketers, and UX practitioners on our list

Emails once or twice a week on growth and optimization.

This form is used explicitly for the https://cxl.com/subscribe/ landing page.

Table of contents

1. Most A/B tests won’t produce huge gains (and that’s okay)

2. There’s a lot of waiting (until statistical confidence)

3. Trickery doesn’t provide serious lifts, understanding the user does

Conclusion

Join 95,000+ analysts, optimizers, digital marketers, and UX practitioners on our list

Related Posts