Site icon CXL

How to Minimize A/B Test Validity Threats

Validity Threats

You have an A/B testing tool, a well-researched hypothesis and a winning test with 95% confidence. The next step is to declare the winner and push it live, right?

Not so fast.

There are factors threatening the validity of your test, without you even realizing it. If they go unrecognized, you risk making decisions based on bad data.

What Are Validity Threats?

Validity threats are factors that can threaten the validity of your A/B test results.

There are two types of errors that occur in statistics, Type I and Type II. Type I errors occur when you find a difference or correlation when one doesn’t exist. Type II errors occur when you find no difference or correlation when one does exist.

Validity threats make these errors more likely.

In order to understand validity threats, you must first understand the different types of validity. Of course, there are many, but the three most common (and relevant) types of validity for conversion optimization are: internal validity, external validity, and ecological validity.

If it can be proven that the cause comes before the effect and the two are related, it’s internally valid. If the results can be generalized beyond that individual test, it’s externally valid.

Often, internal and external validity work against each other. Efforts to make a test internally valid can limit your ability to generalize the results beyond that individual test.

Ecological validity looks how applicable the results are in the real world. Many formal, laboratory-based tests are criticized for not being ecologically valid because they are conducted in artificially controlled environments and conditions.

Yeah, But That Doesn’t Affect Me, Right?

Of course it affects you. Validity threats affect anyone who runs a test, whether they realize it or not. Just because test results look conclusive, doesn’t mean they are.

For example, Copy Hackers ran a test on one of their home page a couple of years ago. They decided to optimize this section of their page for “Increase Your Conversion Rate Today” clicks…

During the first two days, the results were very up and down. After just six days, their testing tool declared a winner with 95% confidence. Since it hadn’t been a full week, they let the test run for another day. After a week, the tool declared a 23.8% lift with 99.6% confidence.

Seems like a conclusive win, right? Well, it was actually an A/A testing, meaning both the control and the treatment were exactly the same.

Testing tools cannot account for all validity threats for you, which means you could be drawing insights and basing future tests on inconclusive or invalid results. These types of results due to validity threats can, and do, happen to you.

(You just don’t know it, yet.)

How Can I Eliminate Validity Threats?

The simple answer is that you can’t completely eliminate validity threats. They do and will always exist. It’s not about eliminating validity threats completely, it’s about managing and minimizing them.

As Angie Schottmuller explains in detail, it’s important to recognize validity threats prior to testing…

Angie Schottmuller, Growth Marketing Expert:

“Minimizing data ‘pollutants’ to optimize integrity is the hard part. Brainstorming and reviewing a list of any technical and environmental factors/variables that could potentially corrupt test validity is done up-front.

Pro Tip: Do this with your campaign team (e.g. PPC, SEO, IT, brand, etc.) whenever possible to best inventory and understand test risks and gain all-around support for testing. The team will be less likely to implement a change (‘test threat’) if they were involved up front, and it provides more eyes to monitor for unexpected ‘pollutants’ like competitor promotions, direct mail campaigns, or even regional weather factors.

I try to ensure the team documents in detail all approved or unexpected changes during the test. (e.g. PPC campaign management for excluding/adding keywords or ad placement.) Quality and timely team communication regarding changes is imperative, since some updates might void test validity. It’s better to learn quick and adapt than to proceed ignorant and foolish.

Note: After inventorying test threat variables it is possible to conclude that a planned test would produce (invalid) inconclusive results. Concurrent campaigns, technical upgrades, holidays, or other significant events might present too much risk. It’s best to recognize this BEFORE testing. Minimally, leverage it as an opportunity to show the inherent risk in testing and the need for iterative tests and multiple approaches (quantitative and qualitative) to validate a hypothesis.”

There is always risk in A/B testing, so before you test, go through these steps:

  1. With the help of your entire team, inventory all of the threats.
  2. Make your entire team aware of the test so that they do not create additional threats.
  3. If the list of threats is too long, postpone the test.

Why waste the testing traffic when the risk of invalid results is so high? It just doesn’t make sense. Instead, wait until you can run the test and gather conclusive insights.

Remember, as an optimizer, your primary goal is discovering conclusive insights that can be applied to all areas of your business, not simply winning individual tests.

Angie Schottmuller, Growth Marketing Expert:

‘With great [test significance] power comes great [validity risk] responsibility.’
– Benjamin Parker (Spider-Man’s Uncle), a / Adapted by @aschottmuller

Balance accordingly and keep focus on learning confident insights, not simply achieving a desired single-test result. ”

8 Common Validity Threats Secretly Sabotaging Your A/B Tests

Validity threats are widespread and come in many different forms, making them difficult to recognize. There are, however, some common validity threats that you’re more likely to run into than others.

1. Revenue Tracking

Inserting the testing tool javascript snippet is easy. But if you run an eCommerce site, you also need to implement revenue tracking for the testing tool. This is where things often go wrong.

Leho Kraav, a Senior Conversion Analyst at CXL had this to say about revenue tracking implementation:

Leho Kraav, CXL:

“The main mistakes I see are…

1. Not having multiple tools counting revenue / transactions and reporting results in a format that even your mother should be able to understand whether things add up. Any tool can potentially mess something up, so you need backup. Basic combination: sales backend, Google Analytics e-commerce, and Optimizely. Main problem: understandable, consistent cross-tool reporting.

2. Messing up tracking tag firing when using multiple payment gateways. Examples: gateways send people to different thank you pages where different tags fire, bad return from gateway IPN handling in the code, etc.

3. Depending on the complexity of the checkout process of your app, it’s possible to accidentally fire revenue tracking multiple times. Depending on the tool, this can artificially inflate your revenue numbers. Once again, using multiple tools is helpful to verify any one tool’s periodic revenue report.

4. Not paying attention to the specific format tools want the transaction value in. Some want cent values, some dollar values. Fortunately, this is easy to notice unless you’re completely not paying attention.”

How to manage it…

2. Flicker Effect

The flicker effect is when your visitor briefly sees the control before the treatment loads. This can happen for various reasons, like…

How to manage it…

3. Browser / Device Compatibility

Your treatments keep losing and you can’t figure out why. Provided that you’re actually testing the right stuff, the most common reason for failing tests is crappy code (i.e. variations don’t display / work properly in all browser versions and devices).

If it doesn’t work right on the browser / device combination your visitor is using, it will affect the validity of the test (i.e. non-working variations will lose).

This happens a lot when using the visual editor. (Unless you’re just doing copy changes, don’t ever use the visual editor.) The generated code is often terrible, and you might be altering elements you don’t mean to.

You can’t launch a test without doing proper quality assurance (QA). You need someone to put in the time to conduct cross-browser and cross-device testing for each variation before starting the experiment.

For example…

How to manage it…

4. Sample Size

Essentially, a test doesn’t produce valid results when you reach 90% significance, you must continue the test until you’ve reached the necessary sample size as well. Peep has written on stopping A/B tests, which covers sample size in-depth. If you have a few minutes, I suggest reading it thoroughly.

For example…

How to manage it…

5. Selection Bias

Not only does your sample need to be large enough, but it needs to be representative of your entire audience, as well. Earlier this year, I wrote an article on sample pollution, which covers sampling a representative audience in detail.

Let’s say you have a low traffic site, and in order to run the experiment within a reasonable amount of time, you decide to acquire paid traffic. So you start a PPC campaign to drive paid traffic to the site. The test runs its course and you get a winner – 30% uplift!

You stop the PPC campaign and implement the winner, but you don’t see the 30% improvement in your sales. What gives? That is selection bias in action.

You assumed that your PPC traffic behaves the same way as the rest of your traffic, but it likely doesn’t.

Similarly, if you run a test using only your most loyal visitors or only women or only people who make over $75,000 a year, you don’t have a representative sample. That means your results cannot be generalized.

How to manage it…

6. Day of the Week & Time of Day

Your traffic behaves differently on a Saturday than it does on a Wednesday. Similarly, your traffic behaves differently at 1 p.m. than it does at 1 a.m.

For example…

On different days of the week and at different times of day, your traffic behaves differently. They are in different environments, different states of mind. When the context changes, so does their behavior.

If you ran a test from Monday to Friday, for example, you would have inconclusive results, even if you reached significance. What’s true for weekday traffic cannot be generalized to weekend traffic with complete accuracy.

How to manage it…

7. Season, Setting & Weather

Other external and seemingly irrelevant factors that affect the validity of your test include: the season, the setting visitors are in when they visit your site, the weather, and even the media.

For example…

How to manage it…

8. Competitive & Internal Campaigns

As Angie mentioned above, your conversions could be increasing due to competitive and internal campaigns that are unrelated to your test.

For example…

How to manage it…

Conclusion

Many A/B tests are plagued by validity threats, but most can be managed and eliminated. [Tweet It!]

The result? More accurate test results and more valuable insights.

Here’s the step-by-step process to removing the common validity threats…

  1. Always integrate your testing tool with Google Analytics to see if the revenue numbers (mostly) match up.
  2. Reduce the flicker to 0.0001 seconds so that the human eye won’t detect it. Optimize your site for speed, try split URL testing instead, remove your A/B testing tool script from Tag Manager.
  3. Always conduct quality assurance for every device and every operating system.
  4. Run tests separately for each type of device.
  5. Stop your test only when you have sampled the correct sample size.
  6. Take data fishing into account and adjust your significance level to compensate for the number of tests you’re running.
  7. Run tests for as long as necessary (i.e. until your sample size is reached) and not a moment longer.
  8. When testing, use a representative sample and don’t acquire traffic from “unusual” sources.
  9. Run tests for full week increments so that you include data from every day of the week and every time of day.
  10. If you run a test during a holiday season, know that the data you collect is only relevant to that season.
  11. Look at your annual data and identify anomalies (in traffic and conversions). Account for this when running your tests.
  12. Be aware of pop culture and the media. Document major news stories as they happen.
  13. Talk to your team before you run a test. Are there any marketing campaigns running during that time? Take inventory before you run your tests.

Related Posts

Exit mobile version