Skip to content

Statistical significance

Statistical significance is the verdict that a test result is unlikely enough under chance alone that you’ll act on it. Mechanically: a result is “significant” when its p-value is below your chosen alpha threshold. Conventionally alpha is 0.05, so p < 0.05 = significant.

It’s the gate that turns “we observed a difference between variant and control” into “we’ll ship the variant”. Below alpha, the data is surprising enough under the null hypothesis that you reject the null and treat the variant as a real effect. Above alpha, the data could plausibly have happened by chance, so you don’t reject.

In a typical A/B test:

  1. Before the test, you pick alpha (usually 0.05). That’s your false-positive budget.
  2. You calculate sample size based on the effect you want to detect and the alpha you set.
  3. You run the test to the planned sample.
  4. You compute the p-value of the variant-vs-control comparison.
  5. If p is below alpha, the result is significant. If not, it isn’t.
  6. Significant → ship the variant. Not significant → keep the control.

The verdict is binary even though the underlying p-value is continuous. p = 0.049 and p = 0.051 are nearly identical evidence-wise, but one crosses the threshold and one doesn’t. That’s by design - you committed to a binary decision rule before the data came in.

Significance means the data is unlikely to have occurred under the null hypothesis. It does not mean:

  • The effect is large
  • The effect is important
  • The variant is “definitely” better
  • There’s a 95% probability the variant is better

These all sound like reasonable rephrasings of significance and they’re all wrong. Significance is a property of the procedure (“we used a rule with a 5% false-positive rate”), not a probability about your specific result.

The mistake most people make is treating significance as a verdict on importance. It isn’t. A statistically significant 0.1% lift on a million-visitor site might not be worth the engineering cost to ship. A non-significant 8% lift on an underpowered test might be a real effect you just couldn’t detect with the sample size you had.

These come apart constantly. Statistical significance just means “the result is unlikely to be chance”. Practical significance asks “is the effect actually big enough to matter?”. The MDE (minimum detectable effect) you set during sample size calculation is the practical-significance floor - the bar you decided was worth shipping for, before you saw any data. If your “significant” result is below your MDE, you have a real but trivial effect and you probably shouldn’t ship it.

The Bayesian framework reframes this more usefully. Bayesian tools give you expected loss directly - “if I ship the wrong variant, here’s the cost I should expect”. That maps to a ship/don’t-ship decision more cleanly than a binary “significant / not significant” verdict.