Skip to content

Control vs treatment

Every A/B test has at least two groups:

  • Control - the existing version. The baseline against which the variant is measured.
  • Treatment - the variant being tested. Embodies the hypothesis.

The comparison between them is the experiment. Random assignment makes the two groups equivalent in expectation, so any outcome difference is plausibly the effect of the change rather than the effect of which users happened to land where.

In most simple A/B tests the control is obvious - it’s whatever’s currently live. In bigger redesigns the question gets harder. Is the control the current site as-is, or the current site with minor cleanup that was going to ship anyway? Is the control the version users see now, or a tighter version that strips out known broken bits?

The principle is that control should represent the realistic counterfactual - what would have been live if the test hadn’t happened. Choosing an artificially weak control inflates the apparent lift. Choosing a “fairer” control that doesn’t actually represent the status quo distorts the strategic decision.

A/B/C/D tests use one control and multiple treatments. The maths is straightforward but the multiple testing problem kicks in. More variants compared against control means more chances for one to randomly cross threshold. Each additional treatment needs more sample to maintain the same false-positive rate.

A holdout is a special kind of control. The variant ships to 90% of traffic; 10% stays on control permanently for long-run measurement. The holdout’s purpose is different. It measures the long-tail effect of the change after launch, not whether the change wins in a fixed-horizon test. See holdout groups.

  • Treating the launched winner as the new “control” for the next test without considering whether the change was actually durable. The winner from a noisy underpowered test isn’t a stable baseline.
  • Not documenting what control was at the time the test ran. Three months later when you want to reanalyse, “current site” doesn’t tell you what was actually live.
  • Assuming control performance is stable. Seasonality, ad-mix changes, and other tests running on related surfaces all move control. Comparisons against historical control data are usually wrong.
  • Using a control that no real user ever sees. Some teams test against a stripped-down “clean baseline” instead of the live site. The result doesn’t generalise to launch.