Interpreting results

Most teams read test results as a binary: did the primary metric cross alpha or not. That’s the worst possible interpretation. A test result is a multi-dimensional signal and reducing it to one bit throws away most of what the test told you.

A complete read of a test includes:

The point estimate and confidence interval

The point estimate (the observed effect size, like a +4% lift) is the most likely value. The confidence interval is the range of values consistent with the data. The width tells you how much you actually learned.

A “significant” +4% lift with a CI of [+0.5%, +7.5%] is a different result from the same +4% with a CI of [+3.7%, +4.3%]. Both pass the threshold. The first is barely informative (the true effect could be trivial or large). The second is precise.

The primary metric in context

Did it cross the threshold? By how much? Is the magnitude commercially relevant (above the MDE you set during sample size calculation)?

A statistically significant 0.3% lift on a metric where the MDE was 5% is technically a positive result and substantively a non-result. The test wasn’t powered for that effect size and the estimate is dominated by noise.

Guardrails

Did any guardrail breach? A primary metric win that comes with a 15% bounce-rate increase isn’t a clean win. The full ship decision needs both.

Segments

Did the effect look consistent across the main segments (mobile / desktop, new / returning, paid / organic)? Or is the headline result hiding a strong positive on one segment and a negative on another? Pre-specified segment analysis is part of interpretation. Exploratory segment fishing is not (see multiple testing).

Plausibility

Does the magnitude of the effect make sense given the size of the change? A 30% lift from a button colour change is almost certainly noise or a measurement artefact, even if it’s “significant”. An effect that’s much bigger than the intervention should justify is a flag, not a celebration.

Things people get wrong

Shipping a “significant” result without looking at effect size or CI. Significance alone doesn’t tell you the magnitude or the precision.
Ignoring segment results when they look consistent with the headline. When they’re consistent you don’t gain much new info, but if they’re inconsistent you’ve found a real problem the headline buried.
Treating a non-significant result as “the change didn’t work”. Could just be underpowered. The honest interpretation is “we didn’t see enough evidence to call it”.
Ignoring implausibility. A 30% lift from a tiny change is much more likely to be a measurement bug than a real effect. Investigate before shipping.