Skip to content

Multiple testing

Every test you run at alpha = 0.05 has a 5% chance of returning “significant” by pure chance, even when there’s no real effect. Run 20 such tests against the same null and you’d expect one false positive on average. Run 100 segment splits looking for a winner and you’ll find one whether or not there’s anything there.

This is the multiple testing problem. It shows up in three flavours in CRO:

  • Multiple variants in a single test (MVT, or A/B/C/D). More variants compared against control = more chances for one to randomly cross the threshold.
  • Multiple metrics on the same test (conversion, AOV, bounce, time on page, scroll depth). More outcomes you check = more chances of finding “significance” somewhere.
  • Multiple segments analysed post-hoc (mobile, desktop, new users, returning, top-of-funnel sources, etc). This is the worst offender. With 10 segments at alpha 0.05 you’d expect one false positive per test, every test.

Bonferroni - divide alpha by the number of tests. Running 5 comparisons? Use alpha = 0.01 for each. Simple, conservative, often too conservative. Tanks your power hard.

Benjamini-Hochberg (FDR) - controls the false discovery rate (the proportion of “significant” findings that are false positives) rather than the family-wise error rate. Less conservative than Bonferroni and more practical when you’re doing many comparisons and willing to live with some false positives in exchange for more discoveries.

In practice for CRO: most testing platforms don’t apply any correction by default. If you’re checking multiple metrics or segments, you need to either pick a primary metric up front and treat the rest as exploratory, or apply a correction manually.

  • Treating segment analysis like the headline result. A “significant” mobile-only effect in a 6-segment cut is much less convincing than a significant overall effect, because you’ve effectively run six tests.
  • Declaring a winner from MVT based on the variant that crossed p < 0.05 without any correction.
  • Ignoring the problem because “the platform handles it”. Most don’t.
  • Thinking Bonferroni is the only option. It’s the most famous correction, not the best for most CRO use cases.