Now we need to test the issues to validate and learn. Pick a testing tool, and create treatments / alternative variations to test against the current page (control).
There’s no shortage of testing tools, one even built into Google Analytics and completely free. I use Optimizely mostly, but there’s also VWO, Qubit, Adobe Target, Convert.com, and many others.
Testing is no joke – you have to test right. Bad testing is even worse than no testing at all – the reason is that you might be confident that solutions A, B, and C work well while in reality they hurt your business. Poor A/B testing methodologies are costing online retailers up to $13bn a year in lost revenue, according to research from Qubit. Don’t take this lightly!
*Very typical* story of a business that does a/b testing is that they run 100 tests over the year, yet a year later their conversion rate is where it was when they began. Why? Because they did it wrong. Massive waste of time, money and human potential.
You need to make sure your sample size is big enough
In order to be confident that the results of your test are actually valid, you need to know how big of a sample size you need.
You need a minimum number of observations for the right statistical power. Using the number you get from the sample size calculators as a ballpark is perfectly valid, but the test may not be as powerful as you had originally planned. The only real danger is in stopping the test early after looking at preliminary results. There’s no penalty to have a larger sample size (only takes more time).
As a very rough ballpark I typically recommend ignoring your test results until you have at least 350 conversions per variation (or more – depending on the needed sample size). But don’t make the mistake of thinking 350 is a magic number, it’s not. Always calculate the needed sample size ahead of time! Related reading: Stopping A/B Tests: How Many Conversions Do I Need?
If you want to analyze your test results across segments, you need even more conversions. It’s a good idea to run tests targeting a specific segment, e.g. you have separate tests for desktop, tablets and mobile.
Once your test has enough sample size, we want to see if one or more variations is better than Control. For this we look at statistical significance.
Statistical significance (also called statistical confidence) is the probability that a test result is accurate and not due to just chance alone. Noah from 37Signals said it well:
“Running an A/B test without thinking about statistical confidence is worse than not running a test at all—it gives you false confidence that you know what works for your site, when the truth is that you don’t know any better than if you hadn’t run the test.”
Most researchers use the 95% confidence level before making any conclusions. At 95% confidence level the likelihood of the result being random is very small. Basically we’re saying “this change is not a fluke or caused by chance, it probably happened due to the changes we made”.
When an A/B testing dashboard (in Optimizely or a similar tool) says there is a “95% chance of beating original”, it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance? The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low, e.g. 5% or 1%. Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.
If the results are not statistically significant, the results might be caused by random factors and there’s no relationship between the changes you made and the test results (this called the null hypothesis).
But don’t confuse statistical significance with validity.
Once your testing tool says you’ve achieved 95% statistical significance (or higher), that doesn’t mean anything if you don’t have enough sample size. Achieving significance is not a stopping rule for a test.
Read this blog post to learn why. It’s very, very important.
Duration of the test
For some high-traffic sites you would get the needed sample size in a day or two. But that is not a representative sample – it does not include a full business cycle, all week days, weekends, phase of the moon, various traffic sources, your blog publishing and email newsletter schedule and all other variables.
So for a valid test both conditions – adequate sample size + long enough time to include all factors (full business cycle, better yet 2 cycles) – should be met. For most businesses this is 2 to 4 weeks.
No substitution for experience
Start running tests now.
There’s quite a bit to know about all this, but the content above will already make you smarter about running tests than most marketers.