Skip to content

Type I and Type II errors

Two ways a test can be wrong:

  • Type I error - false positive. The null hypothesis is true (no real effect) but you reject it anyway. You ship a variant that didn’t actually do anything, or worse, something that quietly hurt.
  • Type II error - false negative. There’s a real effect but your test fails to find it. You abandon a real win because the test said “not significant”.

Alpha is the Type I rate you’ve agreed to accept. Beta is the Type II rate. The two trade off. Lowering alpha to be stricter about false positives raises beta, missing more real effects. You can’t drive both to zero without infinite sample size.

In CRO, Type II errors are usually the more expensive ones because they’re invisible. A Type I error eventually shows up as “the lift we shipped didn’t replicate in revenue”. A Type II error just looks like “the test didn’t work, let’s move on” and the real win is silently abandoned.

Gelman’s reframing, more useful than the original I/II framing for most CRO work.

  • M-type (magnitude) - your estimate is in the right direction but way too big. Underpowered tests with a “significant” result tend to inflate the effect size, because only the noisier samples are extreme enough to clear the threshold. You ship expecting a 10% lift and get 3%.
  • S-type (sign) - your estimate is in the wrong direction. Rare but devastating. You think the variant is better when it’s actually worse. Underpowered tests with low base rates are the usual culprit.

These matter because in CRO you don’t just want “the test was significant”, you want “the lift estimate is roughly right and pointing the right way”. Without enough power, neither can be assumed.

  • Worrying only about Type I and ignoring Type II. The whole industry obsesses over false positives. False negatives kill just as much value, they just kill it silently.
  • Assuming a “significant” result means the magnitude is accurate. With low power, the effect size estimate is biased upward (M-type error). You overpromise the lift and underdeliver.
  • Not realising the two errors trade off. You can’t make a test “stricter” without giving up something on the other side.