The peeking problem

In a classical A/B test, you commit to a sample size up front, run to that sample, and check the result once. The maths assumes that single check. Every additional peek along the way gives randomness another chance to push the result across the alpha threshold, and the effective false-positive rate climbs with each look.

The intuition: at any given moment of a noisy test, the running result can swing above and below the threshold by chance. If you only check once at the end, you have a 5% chance of crossing at alpha = 0.05. If you check every day for 14 days, you have many independent chances to cross, and the effective alpha is much higher than 0.05 - often two or three times higher in practice.

Why the maths punishes peeking

The classical statistical machinery assumes the sample size is fixed in advance. The “5% false-positive rate” is computed on the assumption that the analyst looks once. Peeking changes the procedure - now you stop whenever you cross threshold, which is mathematically equivalent to running many tests at the same alpha. Compounded across peeks, the family-wise false-positive rate rises.

The same effect shows up in multiple testing when you check many metrics or segments. Peeking is just multiple testing across time.

What actually happens in practice

Most CRO programmes peek constantly. The platform displays the running result, the team checks it daily, and “winners” get called the moment something crosses 95% confidence. The result: a high proportion of “significant” findings are false positives that wouldn’t replicate.

Worse, peeking interacts with regression to the mean. The points where you’d be tempted to stop early are the points where the running estimate is most inflated by noise. Stopping at those moments locks in the inflated estimate as the test result, even though the underlying effect is smaller.

How to handle it honestly

Run to planned sample, look once. The simplest fix. Pick a sample size, run to it, then check.
Use sequential testing methods that mathematically account for peeking. Works but requires platform support and the stopping rules are stricter than they look.
Use Bayesian inference with proper priors. Bayesian methods have less theoretical peeking penalty but early stopping still biases the effect size upward.

Things people get wrong

Treating “the platform supports continuous monitoring” as “peeking is fine”. Most platforms display running results but don’t apply sequential corrections.
Stopping a test the moment it looks bad. Negative-direction peeking is the same problem as positive-direction peeking - both inflate the false-positive rate.
Recalculating sample size mid-test based on the observed effect. That’s not adaptive design, that’s data dredging.
Assuming the peeking penalty is theoretical. It isn’t - empirical studies of real testing programmes find effective alpha 2-3x the nominal value when peeking is common.