Skip to content

Missing data and outliers

Most test analyses ignore both. Sessions with missing fields get silently dropped. Extreme values (whales spending £5,000, sessions of three hours) get included as if they were typical. Either choice can flip a result without anyone noticing, and the defaults in most testing platforms are usually wrong.

Three flavours of why data goes missing, with different implications:

  • Missing completely at random. The missingness is unrelated to anything (random tracking failure, browser issues). Listwise deletion is fine - dropped users represent the underlying population.
  • Missing at random. The missingness depends on observed variables but not on the outcome. e.g. iOS users have higher tracking opt-out, but within each platform the missingness is random. Listwise deletion works if you stratify, otherwise you bias the result.
  • Missing not at random. The missingness depends on the outcome itself. Users who don’t convert close the tab before the conversion event fires. Listwise deletion biases toward over-counting converters. This is the dangerous case and it’s common.

The pragmatic CRO approach: check the sample ratio by variant. If it’s clean, missingness is probably balanced. If it’s not, missing-data bias is plausibly affecting the result and you should investigate before trusting it.

Extreme values can dominate ratio metrics like AOV and revenue per session. One £10,000 order in a 10,000-session test moves the variant mean by £1 per session, which can flip a “win” into a “loss” or vice versa.

Three handling options:

  • Include as-is. Honest but high-variance. The estimate is correct in expectation but the confidence interval is wide.
  • Winsorise. Cap extreme values at a percentile (e.g. 99th). Reduces variance, keeps the user in the analysis. Standard in mature experimentation platforms.
  • Trim. Drop extreme values entirely. Maximum variance reduction but discards information.

Whichever you pick, do it for both variants identically and decide before seeing the data. Otherwise you’re choosing the rule that confirms the result you wanted.

Both missing data and outliers are worse on ratio metrics than on means. A heavy-tailed AOV is more sensitive to extreme orders than a click-through rate is to extreme click patterns. Apply variance-stabilising treatments (winsorising, log-transforming) more aggressively for ratio metrics where the tail matters.

  • Using whatever the platform defaults to without checking what it does. Some platforms winsorise, some don’t. Some apply listwise deletion, some don’t. The default is usually undocumented.
  • Deciding outlier rules after seeing the data. That’s p-hacking by a different name.
  • Assuming missing data is missing-completely-at-random when it’s not. Tracking failures correlate with bots, mobile, ad-blockers, and other systematic things.
  • Ignoring sample-ratio mismatch as a missing-data signal. SRM often is missing-data bias, surfacing through the wrong indicator.