A/B testing

An A/B test is a randomised controlled experiment comparing two versions of something - usually a webpage, an email, a UI element, an ad. Visitors are randomly assigned to one of two variants and the variant’s effect on a chosen metric is measured. If the variant outperforms the control by a statistically significant margin, you ship it.

A/B testing is the foundational unit of an experimentation programme. The scaffolding around it - primary metric choice, sample size planning, peeking discipline, holdout measurement - exists because A/B tests need that scaffolding to work. A single test is a unit of work; the compounding value comes from running tests repeatedly within a structured CRO process.

The procedure

The full chain in theory:

Form a hypothesis about a change that will affect a metric.
Pick a primary metric and guardrails. Decide what counts as a win and what counts as too high a cost.
Set alpha, power, and the minimum detectable effect you care about.
Calculate the sample size that combination demands. Confirm it’s feasible given your traffic.
Randomly assign incoming traffic to control or variant.
Run until the planned sample is reached. Don’t peek.
Analyse the result. If significant, above MDE, and no guardrail breached, ship.

In practice, every step gets compressed, skipped, or fudged. Hypotheses are vague, sample sizes aren’t calculated, peeking starts on day 1, and “winners” get shipped on noisy underpowered data. Most “A/B testing” in the wild is more like “look at two versions and pick the one with the higher number”. The discipline is in following the procedure even when it’s inconvenient.

When A/B testing is the right tool

A/B tests work best when:

The change is small and bounded. You can isolate its effect from everything else on the page.
The conversion event happens within a reasonable test window. Purchase, sign-up, click - not 90-day retention.
You have enough traffic to power the test on a realistic MDE.
The change can be cleanly toggled (feature flag, theme variant, copy swap), so the variant and control truly differ on one thing.

They work badly when:

The decision is strategic rather than tactical. Pricing model overhauls, brand redesigns, market expansions. A/B testing brand identity is theatre.
The metric only develops over months (LTV, retention curves). Use holdouts instead.
Traffic is too low to power any reasonable MDE. Below ~30k sessions a month, most stores are running directional tests, not statistical ones.
The change is too interconnected to isolate. Site-wide redesigns affect everything; isolating “the redesign’s effect” via a single A/B is usually impossible.

For decisions that don’t fit, the honest alternatives are qualitative research, holdouts, or pre/post analysis with explicit acknowledgement of its limits.

What an A/B test is not

An opinion test. “Run it past the team to see who likes it better.” Not an A/B test.
A sequential rollout. Shipping a change and comparing this week’s metric to last week’s is not an A/B test - the comparison isn’t controlled.
Multivariate testing (MVT). Distinct technique that tests multiple variables simultaneously. Useful in different situations, more sample-hungry, easier to misinterpret.
A bandit algorithm. Bandits dynamically reallocate traffic to better-performing variants in real time. Better when the goal is to maximise reward, worse when the goal is to learn the true effect size.
Pre-post analysis. Comparing the metric before and after a change is not an A/B test because there’s no concurrent control. Any time-varying confounder (seasonality, ad spend, etc) shows up as effect.

Why the structure matters

Random assignment is the engine. Without it, the comparison isn’t valid. Whatever made some users land in variant vs control could be correlated with the outcome - traffic source, time of day, device, ad campaign. Randomisation eliminates these confounds in expectation.

The other half is concurrency. Control and variant run at the same time, on the same traffic. Comparing variant to “last year’s baseline” loses the concurrency and re-introduces all the time-varying confounds the randomisation was supposed to solve.

What makes A/B tests fail in practice

See threats to validity for the full catalog. The common ones:

Underpowered. The test couldn’t have detected the effect size that’s actually there. The “not significant” result is uninformative.
Peeked. Stopped when it looked good, before the planned sample.
Sample ratio mismatch. The randomisation was broken - variant and control got different traffic shapes.
Wrong metric. Optimised a proxy that doesn’t link to revenue or LTV.
Novelty effect. Stopped before the lift had time to fade.
External confounds. Test overlapped a sale, a launch, a press cycle.