There’s a philosophical statistics debate in the A/B testing world: Bayesian vs. Frequentist.
This is not a new debate. Thomas Bayes wrote “An Essay towards solving a Problem in the Doctrine of Chances” in 1763, and it’s been an academic argument ever since.
The issue is increasingly relevant in the CRO world—some tools use Bayesian approaches; others rely on Frequentist. When it comes to running your next test, what the hell does it all mean?
Note: I’m not going to wade too far into the philosophical debate between the two approaches or the intricacies of Bayes’ Theorem. I’ve listed some further reading at the bottom of each section if you’re interested in learning more.
Table of contents
The quick-and-dirty difference between Frequentist and Bayesian statistics
What are Frequentist statistics?
The Frequentist approach to statistics (and testing) is a method which makes predictions on the underlying truths of the experiment, using only data from the current experiment.
You’re probably familiar with this approach to testing. It’s the model of statistics taught in most core-requirement college classes, and it’s the approach most often used by A/B testing software.
As Leonid Pekelis wrote in an Optimizely article,
Frequentist arguments are more counter-factual in nature, and resemble the type of logic that lawyers use in court. Most of us learn frequentist statistics in entry-level statistics courses. A t-test, where we ask, “Is this variation different from the control?” is a basic building block of this approach.
What are Bayesian statistics?
The Bayesian approach to statistics is a method that encodes past knowledge of similar experiments into a statistical device, known as prior. This prior is combined with current experiment data to make a conclusion on the test.
So, the biggest distinction is that Bayesian probability specifies that there is some prior probability.
The Bayesian approach goes something like this (summarized from this discussion):
- Define the prior distribution that incorporates your subjective beliefs about a parameter. The prior can be uninformative or informative.
- Gather data.
- Update your prior distribution with the data using Bayes’ theorem (though you can have Bayesian methods without explicit use of Bayes’ rule—see non-parametric Bayesian) to obtain a posterior distribution. The posterior distribution is a probability distribution that represents your updated beliefs about the parameter after having seen the data.
- Analyze the posterior distribution and summarize it (mean, median, sd, quantiles…).
To explain Bayes’ reasoning in relation to conversion rates, Chris Stucchio gives the example of a hypothetical startup, BeerBnB. His initial marketing efforts (ads in bar bathrooms) drew 794 unique visitors, 12 of whom created an account, giving the effort a 1.5% conversion rate.
Suppose the company could reach 10,000 visitors via toilet ads around the city. How many people should you expect to sign up? About 150.
Another example was something I found in Lean Analytics. There’s a case study about a restaurant, Solare. They know that if, by 5 p.m., there are 50 reservations, then they can predict that there will be around 250 covers for the night. This is a prior and can be updated with new sets of data.
Or, as Boundless Rationality wrote,
A fundamental aspect of Bayesian inference is updating your beliefs in light of new evidence. Essentially, you start out with a prior belief and then update it in light of new evidence. An important aspect of this prior belief is your degree of confidence in it.
So why the controversy?
It’s much easier to debate minute tasks and equations than it is to discuss the testing discipline and the role of optimization in an organization.
Dr. Rob Balon, CEO of The Benchmark Company, agrees:
That said, the argument may not be entirely academic. In a New York Times article, Andrew Gelman defended Bayesian methods as a sort of double-check on spurious results.
As an example, he re-evaluated a study using Bayesian statistics. The study had concluded that women who were ovulating were 20% more likely to vote for President Obama in 2012 than those who were not.
Andrew added in data showing that people rarely change their voting preference during an election cycle—even during a menstrual cycle. Adding this info, the study’s statistical significance disappeared.
In my research, it’s clear that there’s a large divide based on the philosophy of each approach. In essence, they tackle the same problems in slightly difference ways.
- An Intuitive Explanation of Bayes’ Theorem (amazing resource!)
- An Intuitive (and Short) Explanation of Bayes’ Theorem (abridged version of the above)
- A Technical Explanation of Technical Explanation (a more advanced reading)
- Bayesian or Frequentist, Which Are You? (video lecture)
- A List of Data Science and Machine Learning Resources
- Tutorials on Bayesian Nonparametrics
What does this have to do with A/B testing?
A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”
Though Balon referred to the debate as mostly “esoteric tail wagging” and Gershoff used the term “statistical theater,” there are business implications when it comes to A/B testing. Everyone wants faster and more accurate results that are easier to understand and communicate, and that’s what both methods attempt to do.
Though, as Gershoff explains: “Often—and I think this is a massive hole in the CRO thinking—is that we are trying to estimate the parameters for a given model (think targeting) in some rational way.”
Does it matter which you use?
Some say yes, and some say no. Like almost everything, the answer is complicated and has proponents on both sides. Let’s start with the pro-Bayesian argument.
In defense of Bayesian decisions
Lyst actually wrote an article last year about using Bayesian decisions. According to them, ”We think this helps us avoid some common pitfalls of statistical testing and makes our analysis easier to understand and communicate to non-technical audiences.”
They say they prefer Bayesian methods for two reasons:
- Their end result is a probability distribution, rather than a point estimate. “Instead of having to think in terms of p-values, we can think directly in terms of the distribution of possible effects of our treatment. This makes it much easier to understand and communicate the results of the analysis.”
- Using an informative prior allows them to alleviate many of the issues that plague classical significance testing. (They cite repeated testing and a low base-rate problem—though Evan Miller disputed the latter argument on this Hacker News thread.)
They also offered the following visuals in which they drew two samples from a Bernoulli distribution (yes/no, tails/heads), computed the p parameter (probability of heads) estimates for each sample, and then took their difference:
The article is a solid argument in favor of using a Bayesian method (they have a calculator you can use, too), but there is a caveat:
The advantages described above are entirely due to using an informative prior. If instead we used a flat (or uninformative) prior—where every possible value of our parameters is equally likely—all the problems would come back.
Chris Stucchio explains some of the reasons that, several years ago, VWO switched to Bayesian decisions:
But some disagree…
So there’s a good amount of support for Bayesian methods. While there aren’t many anti-Bayesians, there are a few Frequentists as well as people who, generally, think there are more important things to worry about.
Anderson, for instance, says that for 99% of users, it doesn’t really matter.
Balon agrees, contending that the Bayesian vs. Frequentist argument is really not that relevant to A/B testing:
- Bayesian A/B Testing by Evan Miller
- Hacker News discussion on Bayesian A/B Testing
- Probabilistic Programming & Bayesian Methods for Hackers
- Easy Evaluation of Decision Rules in Bayesian A/B testing
Tools and methods
Most tools use Frequentist methods, though, as mentioned above, VWO uses Bayesian decisions. Optimizely’s Stats Engine is based on Wald’s sequential test. This is the sequential version of Pearson-Neyman hypothesis testing approach, so this is a Frequentist approach (with flavors of Bayes).
Conductrics blends ideas from empirical Bayes, with targeting, to improve the efficiency of its Reinforcement Learning engine.
Anderson doesn’t think we should spend much time worrying about the methods behind each tool. As he said about tools that advertise different methods as features:
Though you could dig forever and find strong arguments for and against each side, it comes down to this: We’re solving the same problem in two ways.
I like the analogy that Optimizely gave using bridges:
Just like a suspension and arch bridges both successfully get cars across a gap, both Bayesian and Frequentist statistical methods provide to an answer to the question: which variation performed best in an A/B test?
Anderson also had a fun way of looking at it:
In many cases this debate is the same as arguing the style of the screen door on a submarine. It’s a fun argument that will change how things look, but the very act of having it means that you are drowning.
Finally, you can mess up using either method while testing. As the Times article said, “Bayesian statistics, in short, can’t save us from bad science.”
Working on something related to this? Post a comment in the CXL community!