Skip to content

Hypothesis formulation

A test without a hypothesis is just a guess with extra steps. The hypothesis is what you’re claiming and how you’d know if you’re wrong - everything else (sample size, primary metric, duration) flows out of it.

The classical setup has two hypotheses:

  • Null (H0) - the boring one. There’s no difference between control and treatment. The change you’re testing did nothing.
  • Alternative (H1) - the interesting one. There is a difference. The change did something.

You don’t “prove” H1. You collect enough evidence to reject H0, which is a different thing. This distinction matters because it’s why a “non-significant” test isn’t a negative result, it’s an inconclusive one. Failure to reject H0 doesn’t mean H0 is true, it means you haven’t seen enough to dismiss it. Easy to forget when stakeholders want a clean yes/no.

A non-directional (two-tailed) hypothesis just says “different” - the new design will change the conversion rate, either up or down. A directional (one-tailed) hypothesis specifies which way - the new design will increase conversion. Two-tailed is the safe default and what most tools assume. One-tailed gives you a bit more statistical power for the same sample size but at the cost of being blind to the opposite direction. If your one-tailed bet was wrong and the variant actually tanked conversion, the test won’t tell you.

In practice almost everyone should run two-tailed. Even if you think you know which way the change will go, “it tanked” is a result you want to detect.

What a good hypothesis actually looks like

Section titled “What a good hypothesis actually looks like”

This is the statistical framing - the CRO-flavoured version (testability, mechanism, three-part structure) is in strong hypotheses. A real hypothesis isn’t “let’s change the button colour”. It has three parts:

  1. The change you’re making
  2. The mechanism you think it’ll trigger (the why - usually a behavioural or UX claim)
  3. The metric you expect to move, and roughly by how much

So: “Replacing the generic CTA copy with benefit-led copy on the PDP will increase add-to-cart rate by 5%+, because users currently bounce when the value isn’t clear above the fold.”

If your hypothesis can’t be falsified by the test, it’s not really a hypothesis, it’s a project. “Improve the checkout” isn’t testable. “Removing the discount code field on step 1 of checkout will increase completion rate by 3%+, because the field draws people off-site to hunt for codes” is.

  • Conflating hypothesis with goal. “Make the site convert better” is a goal. The hypothesis is the specific claim about what change moves the metric and why.
  • Treating a non-significant result as proof the change didn’t work. It might just mean you didn’t have the sample size to detect the effect.
  • Hypothesising after results are in (HARKing). Look at the data, find the segment where it “worked”, write a hypothesis around it post-hoc. This is how you get fake wins that don’t replicate.
  • Being vague about the expected effect size. “Lift conversion” gives you no way to do sample size maths. You need a number, even if it’s a guess.