Statistical power

Power is the probability your test will correctly reject the null hypothesis when there really is an effect. Beta (β) is the false-negative rate, the probability you’ll miss a real effect. Power is just 1 - β.

The convention is to aim for 80% power, which means you’re willing to miss real effects 20% of the time. That’s a lot, by the way. One in five real wins gone, just like that.

Power depends on four things:

The alpha you set (lower alpha = stricter = less power)
The effect size you’re trying to detect (smaller effects need more data)
The sample size you have
The variance in your metric (noisier metrics need more data)

These trade off against each other. If you can’t increase sample size, you either accept lower power, settle for detecting only larger effects, or relax your alpha.

Why most CRO programmes are silently underpowered

Most Shopify stores don’t have the traffic to run well-powered tests on realistic effect sizes. A store doing 10k sessions a month, testing for a 5% relative lift on a 3% baseline conversion at α = 0.05, needs roughly 200,000 sessions per variant to hit 80% power. That’s about three years per arm, on a calculator that doesn’t even account for weekly seasonality. So instead they run two-week tests, see “no significant difference”, and conclude the change didn’t work. What actually happened is the test couldn’t have detected the effect even if it was there.

I usually run the sample size calc before the test brief is even written. If the answer is “more than a quarter”, we either pick a bigger swing (layouts and offers, not button colours), move the test up the funnel where there’s more volume, or accept we’re running directional tests and lean harder on qualitative research.