Alpha levels

Alpha (α) is the threshold below which a p-value is considered surprising enough to reject the null hypothesis. It’s the false-positive rate you’ve agreed to accept. Set alpha to 0.05 and you’re saying “I’m willing to be wrong 5% of the time when there’s actually no effect”.

If you ran 100 A/B tests where there was actually no effect, around 5 of them would still come back significant by chance at α = 0.05. That’s the price of using a threshold at all. You can lower alpha to be more conservative (0.01, 0.001) but you’ll need more sample size to detect real effects, because you’re demanding stronger evidence.

0.05 is an arbitrary number (picked in the 1920s). For a low-stakes test, it’s fine. For a checkout redesign that’s expensive to ship and risky to roll back, 0.01 might be safer. For an exploratory test where you’d rather miss real wins than chase false ones, you might even relax to 0.1. Alpha is a business decision not a statistical one.

One-tailed vs two-tailed alpha

If you run a one-tailed test at α = 0.05, you’ve effectively concentrated all your “false positive budget” on one direction - you’re only checking for an effect in the direction you predicted. A two-tailed test at 0.05 splits 2.5% to each side, so a sufficiently extreme result in either direction crosses the threshold. Most tools default to two-tailed, which is the safer call (see hypothesis formulation for why).