Sample size
The pre-test maths that tells you how many visitors you need before you have a reasonable shot at detecting an effect. Any decent A/B testing tool has a calculator built in. Feed it your baseline rate, your minimum detectable effect, your alpha, and your power, and it spits out the required sample per variant.
The four inputs:
- Baseline conversion rate - what your control currently does. The lower the baseline, the bigger the sample you need.
- MDE (Minimum Detectable Effect) - the smallest lift you actually care about detecting. The input people get wrong.
- Alpha - usually 0.05.
- Power - usually 0.8.
The MDE is the load-bearing assumption. It’s not “the effect you expect”, it’s the effect that’s worth detecting. Set MDE too low (say 1%) and you get huge required samples that smaller sites can’t reach. Set it too high (say 20%) and a real but smaller win will look “non-significant” and you’ll incorrectly conclude the change didn’t work.
In practice on Shopify
Section titled “In practice on Shopify”For a typical mid-size Shopify store doing 50k sessions a month, a test on add-to-cart rate with a 2% baseline, MDE of 10% relative, alpha 0.05, power 0.8, needs around 30,000 sessions per variant. That’s about three weeks of all your traffic going through the test. Two-variant tests fit. Three-variant MVTs don’t.
The honest output of a sample size calc is often “you can’t actually run this test in a reasonable time”. That’s useful information. It means either pick bigger swings or accept you’re running directional tests, not statistical ones.
My rule of thumb: if the calc returns more than six weeks, the test gets rescoped or shelved. Six weeks is where seasonality, traffic shifts, and stakeholder patience start to bite, and a test that runs longer than that is fighting more than the null.
Things people get wrong
Section titled “Things people get wrong”- Skipping it entirely. Most CRO programmes just run tests until they feel ready, which usually means “until they look significant”. Peeking, in other words.
- Setting MDE based on what looks achievable rather than what’s worth detecting. The MDE should be a business decision: what’s the smallest lift that’s worth shipping?
- Forgetting that the calc assumes you keep the test running until the planned sample is hit. If you stop early when it looks good, your effective alpha is much higher than the nominal value (see sequential testing for the legitimate version of stopping early).
- Recalculating sample size mid-test based on the observed effect. That’s not a sample size calc, that’s data dredging.