When should you use bandit tests, and when is A/B/n testing best?
Though there are some strong proponents (and opponents) of bandit testing, there are certain use cases where bandit testing may be optimal. Question is, when?
First, let’s dive into bandit testing and talk a bit about the history of the N-armed bandit problem.
The multi-armed bandit problem
The multi-armed bandit problem is a classic thought experiment. Imagine this scenario:
You’re in a casino. There are many different slot machines (known as ‘one-armed bandits,’ as they’re known for robbing you), each with a lever (and arm, if you will). You think that some slot machines payout more frequently than others do, so you’d like to maximize this.
You only have a limited amount of resources—if you pull one arm, then you’re not pulling another arm. Of course, the goal is to walk out of the casino with the most money. Question is, how do you learn which slot machine is the best and get the most money in the shortest amount of time?
If you knew which lever would pay out the most, you would just pull that lever all day. In regards to optimization, the applications of this problem are obvious. As Andrew Anderson said in an Adobe article:
The practical differences between A/B testing and bandit testing
A/B split testing is the current default for optimization, and you know what it looks like:
You send 50% of your traffic to the control and 50% of your traffic to variation, run the test ‘til it’s valid, and then decide whether to implement the winning variation.
In statistical terms, A/B testing consists of a short period of pure exploration, where you’re randomly assigning equal numbers of users to Version A and Version B. It then jumps into a long period of pure exploitation, where you send 100% of your users to the more successful version of your site.
In Bandit Algorithms for Website Optimization, the author outlines two problems with this:
- It jumps discretely from exploration to exploitation, when you might be able to transition more smoothly.
- During the exploratory phase (the test), it wastes resources exploring inferior options in order to gather as much data as possible.
In essence, the difference between bandit testing and a/b/n testing is how they deal with the explore-exploit dilemma.
As I mentioned, A/B testing explores first then exploits (keeps only winner).
Bandit testing tries to solve the explore-exploit problem in a different way. Instead of two distinct periods of pure exploration and pure exploitation, bandit tests are adaptive, and simultaneously include exploration and exploitation.
So, bandit algorithms try to minimize opportunity costs and minimize regret (the difference between your actual payoff and the payoff you would have collected had you played the optimal—best—options at every opportunity). Matt Gershoff from Conductrics wrote a great blog post discussing bandits. Here’s what he had to say:
The benefits of bandits
The first question to answer, before answering when to use bandit tests, is why to use bandit tests. What are the advantages?
They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the “good” arms from the “best” ones more quickly.
- Earn while you learn. Data collection is a cost, and bandit approach at least lets us consider these costs while running optimization projects.
- Automation. Bandits are the natural way to automate the selection optimization with machine learning, especially when applying user target—since correct A/B tests are much more complicated in that situation.
- A changing world. Matt explains that by letting the bandit method always leave some chance to select the poorer performing option, you give it a chance to ‘reconsider’ the option effectiveness. It provides a working framework for swapping out low performing options with fresh options, in a continuous process.
In essence, people like bandit algorithms because of the smooth transition between exploration and exploitation, the speed, and the automation.
A few flavors of bandit methodology
There are tons of different bandit methods. Like a lot of debates around testing, a lot of this is of secondary importance—misses the forest for the trees.
Without getting too caught up in the nuances between methods, I’ll explain the simplest (and most common) method: the epsilon-greedy algorithm. Knowing this will allow you to understand the broad strokes of what bandit algorithms are.
One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards. (source)
Okay, so what do I mean by greedy? In computer science, a greedy algorithm is one that always takes the action that seems best at that moment. So, an epsilon-greedy algorithm is almost a fully greedy algorithm—most of the time it picks the option that makes sense at that moment.
However, every once in a while, an epsilon-greedy algorithm chooses to explore the other available options.
So epsilon-greedy is a constant play between:
- Explore: randomly select action certain percent of time (say 20%);
- Exploit (play greedy): pick the current best percent of time (say 80%).
This image (and the article from which it came) explains epsilon-greedy really well:
There are some pros and cons to the epsilon-greedy method. Pros include:
- It’s simple and easy to implement.
- It’s usually effective.
- It’s not as affected by seasonality.
- It doesn’t use a measure of variance.
- Should you decrease exploration over time?
What about other algorithms?
Like I said, a bunch of other bandit methods try to solve these cons in different ways. Here are a few:
Could write 15,000 words on this, but instead, just know the bottom line is that all the other methods are simply trying to best balance exploration (learning) with exploitation (taking action based on current best information).
Matt Gershoff sums it up really well:
Note: if you want to nerd out on the different bandit algorithms, this is a good paper to check out.
When to use bandit tests instead of A/B/n tests?
There’s a high level answer, and then there are some specific circumstances in which bandit works well. For the high level answer, if you have a research question where you want to understand the effect of a treatment and have some certainty around your estimates, a standard A/B test experiment will be best.
According to Matt Gershoff, “If on the other hand, you actually care about optimization, rather than understanding, bandits are often the way to go.”
Specifically, bandit algorithms tend to work well for really short tests—and paradoxically—really long tests (ongoing tests). I’ll split up the use cases into those two groups.
1. Short tests
Bandit algorithms are conducive for short tests for clear reasons—if you were to run a classic A/B test instead, you’d not even be able to enjoy the period of pure exploitation (after the experiment ended). Instead, bandit algorithms allow you to adjust in real time and send more traffic, more quickly, to the better variation. As Chris Stucchio says, “Whenever you have a small amount of time for both exploration and exploitation, use a bandit algorithm.”
Here are specific use cases within short tests:
Headlines are the perfect use case for bandit algorithms. Why would you run a classic A/B test on a headline if, by the time you learn which variation is best, the time where the answer is applicable is over? News has a short half-life, and bandit algorithms determine quickly which is the better headline.
Chris Stucchio used a similar example on his Bayesian Bandits post. Imagine you’re a newspaper editor. It’s not a slow day; a murder victim has been found. Your reporter has to decide between two headlines, “Murder victim found in adult entertainment venue” and “Headless Body in Topless Bar.” As Chris says, geeks now rule the world—this is now usually an algorithmic decision, not an editorial one. (Also, this is probably how sites like Upworthy and BuzzFeed do it).
b. Short term campaigns and promotions
Similar to headlines, there’s a big opportunity cost if you choose to A/B test. If your campaign is a week long, you don’t want to spend the week exploring with 50% of your traffic, because once you learn anything, it’s too late to exploit the best option.
2. Long-term testing
Oddly enough, bandit algorithms are effective in long term (or ongoing) testing. As Stephen Pavlovich put it:
There are a few different use cases within ongoing testing as well:
a. “Set it and forget it” (automation for scale)
Because bandits automatically shift traffic to higher performing (at the time) variations, you have a low-risk solution for continuous optimization. Here’s how Matt Gershoff put it:
Ton Wesseling also mentions that bandits can be great for testing on high traffic pages after learning from A/B tests:
Ton also mentioned that you can learn from contextual bandits:
c. Blending Optimization with Attribution
Finally, bandits can be used to optimize problems across multiple touch points. This communication between bandits ensures that they’re working together to optimize the global problem and maximize results. Matt Gershoff gives the following example:
Caveats: potential drawbacks of bandit testing
Even though there are tons of blog posts with slightly sensationalist titles, there are a few things to consider before jumping on the bandit bandwagon.
MAB is much much more computationally difficult to pull off unless you know what you are doing. The functional cost of doing it is basically the cost of three engineers—a data scientist, one normal guy to put into code and scale the code of what the data scientist says, and one dev-ops person. (Though the last two could probably play double on your team.) It is really rare to find data scientists who program extremely well.
The second thing, though I’m not sure it’s a big issue, is the time it takes to reach significance. As Paras Chopra pointed out, “There’s an inverse relationship (and hence a tradeoff) between how soon you see statistical significance and average conversion rate during the campaign.”
Chris Stucchio also outlined what he called the Saturday/Tuesday problem. Basically, imagine you’re running a test on two headlines:
- Happy Monday! Click here to buy now.
- What a beautiful day! Click here to buy now.
Then suppose you run a bandit algorithm, starting on Monday:
- Monday: 1,000 displays for “Happy Monday,” 200 conversions. 1,000 displays for “Beautiful Day,” 100 conversions.
- Tuesday: 1,900 displays for “Happy Monday,” 100 conversions. 100 displays for “Beautiful Day,” 10 conversions.
- Wednesday: 1,900 displays for “Happy Monday,” 100 conversions. 100 displays for “Beautiful Day,” 10 conversions.
- Thursday: 1,900 displays for “Happy Monday,” 100 conversions. 100 displays for “Beautiful Day,” 10 conversions.
Even though “Happy Monday” is inferior (20% conversion rate on Monday and 5% rest of the week = 7.1% conversion rate), the bandit algorithm has almost converged to “Happy Monday, ” so the samples shown “Beautiful Day” is very low. It takes a lot of data to correct this.
Chris also mentioned that bandits shouldn’t be used for email blasts:
As mentioned above, the situations where bandit testing seems to flourish are:
- Headlines and short-term campaigns;
- Automation for scale;
- Blending optimization with attribution.
Any questions, just ask in the comments!