Skip to content

Threats to validity

A test result is only as good as the design that produced it. The list of ways a test can produce a “significant” finding that isn’t real is long, but most fall into a handful of categories. Knowing the categories is what separates trusting a winner from spotting that it’s an artefact.

The things that bias inference even when the test is set up correctly:

  • Peeking. Checking results before the planned sample inflates the false-positive rate.
  • Multiple testing. Many metrics, many segments, many variants - each comparison gets a chance to cross alpha by chance.
  • Underpowered tests. “Not significant” doesn’t mean “no effect” if the sample was too small to detect it.
  • Sample ratio mismatch. A split that’s significantly off the planned ratio is a sign the assignment is broken, and the result is untrustworthy regardless of how nice the numbers look.
  • Wrong variance model. Using naive variance on ratio metrics inflates apparent significance.

The things that bias the data before it even enters the analysis:

  • Non-random assignment. IP-based, time-of-day-based, or other non-random bucketing maps confounders directly onto variant.
  • Contamination across variants. Users seeing both variants, sessions stitched incorrectly across devices, shared browsers leaking exposure between users.
  • Wrong randomisation unit. Randomising users but measuring sessions, or randomising sessions while users hop between them.
  • Test on test interactions. Multiple tests running on overlapping surfaces, polluting each other’s metrics.

The ones that depend on when and how long the test runs:

  • Novelty effect. Initial lifts that fade as users habituate to the change.
  • Regression to the mean. Early-test peeking on noisy data inflates the apparent estimate.
  • Seasonal confounding. A test that overlaps Black Friday or a brand campaign mixes the variant effect with the external event.
  • Day-of-week effects. Tests run for fewer than 7 days miss day-of-week variance entirely.

The ones that come from the metric itself:

  • Wrong primary metric. A metric that doesn’t link to the outcome you actually care about can show wins that don’t translate.
  • Missing data and outliers. Sessions dropped silently, extreme orders dominating ratio metrics (see handling missing data and outliers).
  • Proxy metric drift. The proxy used to be a good stand-in for revenue or LTV but isn’t anymore. Tests on the proxy mislead.
  • Tracking failures. Conversion events firing inconsistently across variants because of variant-specific JS issues or third-party script interactions.

The ones at the analysis step:

  • HARKing (hypothesising after results are known). Generating the hypothesis after looking at the result and presenting it as a prediction. See hypothesis formulation.
  • Segment fishing. Slicing the data until something “wins”, then writing up that segment as the result.
  • Metric shopping. Trying many metrics post-hoc and reporting whichever crossed threshold.
  • HiPPO override. The result is correctly interpreted and then politically reversed.

Pre-registration handles most of the inference threats. SRM checks handle most of the design threats. Power calculations handle the statistical threats. Holdouts and longer test windows handle most of the time threats. Tracking audits handle the measurement threats.

None of these are exotic. They’re just rarely all done together. A programme that consistently runs through the threat catalog for every test is unusual, and is also the programme whose results actually compound.