Bandit algorithms vs fixed-horizon tests

Two ways of running an experiment, optimising for different goals:

Fixed-horizon testing - commit to a sample size, run to it, look once, ship the winner. The classic A/B test design. Optimises for learning the true effect size accurately.
Bandit algorithms - continuously reallocate traffic toward better-performing variants as data accumulates. Optimises for maximising reward over the experiment lifetime, not for measuring effect size.

The trade-off is between learning and earning. Fixed-horizon spends roughly equal traffic on all variants (including losing ones) so you get a clean estimate at the end. Bandits move traffic away from underperforming variants quickly, so you earn more during the experiment but learn less about the underperformers.

How a bandit works

The simplest version is epsilon-greedy: most of the time, show the best-performing variant so far. Some fraction of the time (epsilon, often 5-10%), randomly explore. The exploration keeps the bandit honest about whether the current “best” is really best.

Smarter variants - Thompson sampling, upper confidence bound (UCB) - allocate based on the uncertainty of each variant’s estimate. Variants with wide credible intervals get more traffic because there’s more to learn about them. Variants with narrow intervals get whatever their estimated reward dictates.

When bandits make sense

Limited-time campaigns. Holiday promotions, email subject-line tests for a single send. You can’t reuse the winner later because the moment has passed. Earn-while-you-test is the right framing.
Many variants with unequal expected performance. If you’re testing 20 ad creatives and most are mediocre, bandits will quickly route traffic to the 2-3 strong ones. Fixed-horizon would waste sample on the duds.
Catalogue-style optimisation where you’re not picking one winner but allocating across many items. Recommendation systems are massive multi-armed bandits under the hood.

When fixed-horizon makes sense

You need to learn the true effect size, not just pick a winner. The fixed-horizon estimate is cleaner because both arms got equal-ish sample.
The decision is binary and durable. “Should we ship this checkout redesign or not?” wants a clean ship/don’t-ship verdict, not an ongoing reallocation.
Effects might invert over time. Bandits can latch onto an early winner that turns out to be worse long-term (novelty effect). Fixed-horizon at least gets sample on both arms long enough to detect the reversal.
Stakeholders want a “the test won” moment. Bandits make ongoing soft decisions that don’t map to the conventional ship/no-ship language.

Things people get wrong

Treating bandits as “smarter A/B tests”. They’re optimising a different objective. Use them when you want reward, not when you want learning.
Using bandits to test high-stakes durable changes. The variance-managed reallocation can lock in a noisy early winner.
Assuming bandits remove the peeking problem. They handle it differently but don’t eliminate it - inference about the underlying effect size is biased by the adaptive allocation.
Running bandits on small variant sets (2 variants) where the sample-efficiency gain is minimal. The complexity overhead isn’t worth it for simple A/B comparisons.