Skip to content

Frequentist vs Bayesian

Two ways of doing the same thing - deciding whether your test result is real - that answer subtly different questions and produce different-feeling outputs.

Frequentist is the classical approach. You assume a null hypothesis (there’s no effect), then ask “given that, how likely is the data I just observed?”. If the data is sufficiently unlikely under the null, you reject it. Output: a p-value, a confidence interval, and a binary significant / not significant verdict. This is what tools like Optimizely Classic, Google Optimize (RIP), and most enterprise platforms used to give you.

Bayesian flips the question. You start with a prior belief about the effect size (often a weak, uninformative prior), update it with the data, and end up with a posterior distribution over possible effects. Output: a probability that B beats A directly (“there’s an 87% chance the variant is better”), expected loss if you pick wrong, and credible intervals (the Bayesian analogue of confidence intervals, but they actually mean what people intuitively think CIs mean).

The difference isn’t really mathematical, it’s about what question you’re answering. Frequentist asks “how surprising would this data be if the null were true?”. Bayesian asks “given the data, what should I believe about the effect?”. The second one is what business stakeholders actually care about.

VWO, Convert, Statsig, AB Tasty - most modern testing tools default to Bayesian. There are a few reasons:

  • The output is more intuitive. “87% chance variant wins” lands better than “p = 0.04” with a sales team.
  • It handles the peeking problem more gracefully. Bayesian methods don’t have the same false-positive inflation when you check results frequently (though they’re not immune to it, early stopping still biases your estimate toward extremes).
  • Expected loss is a directly actionable metric. Frequentist gives you significance, Bayesian gives you “if I ship the wrong variant, here’s the cost I should expect”. That maps cleanly to ship/don’t-ship decisions.
  • It doesn’t pretend you have no prior belief. You always do.

The Frequentist defence is that priors are subjective and you can game outcomes by choosing them. In practice most Bayesian CRO tools use weakly informative or uniform priors which are basically neutral, so this is more of a theoretical concern than a practical one.

Academic publishing. Regulatory submissions. Anywhere the audit trail and reproducibility matter more than the result being easy to read. Same goes for very small samples - Bayesian leans on whatever starting assumption you feed it, and when there isn’t much data that assumption ends up doing more work than the data does. Frequentist sidesteps that.

For Shopify CRO specifically? Bayesian almost every time. Faster reads, more intuitive outputs, and the maths handles the realities of running tests in a low-volume, high-variance environment better. The cost is that you have to be a little more careful explaining what “87% chance to win” actually means. It’s not the same as “87% certain about the size of the lift”, and it’s not a guarantee the lift will replicate.

What I actually do: default to whatever the platform gives me (usually Bayesian) and cross-check anything close to a ship decision with a frequentist read. If the two strongly disagree, something else is off - usually low power, a bad split, or a metric that doesn’t behave like the model assumes.

  • Thinking the two approaches will give you contradictory answers. They usually don’t. With a flat prior and large sample, Bayesian posteriors and Frequentist intervals converge.
  • Treating “97% chance to win” as “definitely going to win”. That’s still a 3% chance you’re wrong, and the lift estimate has its own uncertainty separate from the win probability.
  • Assuming switching frameworks fixes a broken test. If your sample size is too small, your metric is noisy, or your assignment is biased, neither approach will save you.