Skip to content

Holdout groups

A holdout is a slice of traffic that stays on control after the experiment ships, often permanently. Typically 5-10% of users. The variant is “shipped” to the other 90-95%, but the holdout never gets it, so you can measure the long-tail effect of the change against a clean baseline weeks or months after launch.

This addresses a specific blind spot. A/B tests usually run for 2-4 weeks. A test that wins in that window can underperform over the following 90 days because of novelty effect fading, retention impact, or interaction with seasonality. The holdout is the way to actually measure that.

  • Big changes that might affect retention. Subscription flows, pricing changes, onboarding redesigns. The short-term conversion lift can be misleading if the customers acquired through the variant churn faster.
  • Changes with known novelty risk. Anything visually disruptive or behaviourally novel. Tests show inflated initial lift that fades.
  • Changes that interact with downstream funnels. A “winning” PDP that brings in lower-LTV buyers via aggressive discounting is a long-run loss. The holdout catches it.

The holdout gives you the variant’s effect on LTV, retention, repeat purchase, and any other metric that needs time to develop. The metric that mattered in the short-term test is usually still measured, but the interesting numbers are the ones that couldn’t show up in the test window.

Comparison is usually done with the same statistical machinery as the original A/B test, just with the variant population vs the holdout population over a longer time frame.

Holdouts are unprofitable by design. The 10% that doesn’t get the change foregoes whatever uplift the change actually produces. For changes that genuinely lift revenue, the holdout is a small permanent revenue tax. Teams need to be willing to pay it for the measurement value.

That cost is the main reason holdouts are rare outside mature programmes. Most teams take the short-term win and skip the long-run measurement. They learn what the long-run effect was only when they finally turn the change off and discover what the underlying baseline is doing.

  • Using holdouts on every change. The opportunity cost compounds. Reserve them for changes where long-run measurement is genuinely needed.
  • Forgetting the holdout exists and re-using its users in subsequent tests. Tracking gets messy when one user is in multiple permanent buckets.
  • Shipping changes to the holdout because “they should have it by now”. Once the holdout is contaminated, the long-run comparison is lost.
  • Treating holdout results as proof the original A/B test was right or wrong. They’re complementary measures of different time horizons, not a retroactive verdict on the test.