Skip to content

Strong hypotheses

The statistical version of a hypothesis (see hypothesis formulation) is about H0 and H1 - the null you’re trying to reject and the alternative you’re claiming. That’s the framework for analysis. The CRO version is the framework for test design: what makes the proposed test worth running and the result interpretable.

A strong CRO hypothesis has three parts:

  1. The change - exactly what you’re varying in the variant.
  2. The mechanism - why you expect the change to affect the metric. The behavioural or UX claim behind it.
  3. The expected effect - what metric will move, in what direction, by roughly how much.

A weak hypothesis is missing one or more of these. “Change the CTA copy” is a change without a mechanism or effect. “Improve the checkout” isn’t a change at all - it’s a project, with no testable claim.

Replacing the generic CTA copy on the PDP with benefit-led copy will increase add-to-cart rate by 5%+, because cold visitors currently bounce when the value isn’t clear above the fold.

Each part is doing work:

  • Change: specific (CTA copy from X to Y, on the PDP)
  • Mechanism: specific (cold visitors don’t see the value above the fold)
  • Effect: specific (add-to-cart, +5% relative)

The mechanism is what most teams skip. Without it, the test is a guess - if the result is positive you don’t know why, and if it’s negative you don’t know what to try next. The mechanism is what makes the test a learning opportunity rather than a coin flip.

The hypothesis has to be able to fail. “The new design will perform better” can’t fail - any direction of result confirms or rejects nothing specific. The strong version can fail in specific ways, and each failure mode teaches you something:

  • Add-to-cart doesn’t move → the value-above-the-fold mechanism wasn’t the bottleneck
  • Add-to-cart moves but checkout doesn’t → the friction is somewhere else
  • Add-to-cart drops → the new copy is worse than the old

A non-falsifiable hypothesis can’t fail and so can’t teach. Tests on non-falsifiable hypotheses always look like wins or “inconclusive”, which is the same as never having run them.

Strong hypotheses are the input to ICE / PIE / RICE prioritisation. A backlog of hypotheses that each have the three-part structure can be ranked meaningfully. A backlog of “let’s test this” entries can’t.

The prioritisation pass also acts as a quality filter. Weak hypotheses get sent back for sharpening before they ever reach the test queue. A team that runs prioritisation seriously rarely has bad tests, because the bad ones are caught upstream.

  • The change without the mechanism. “Move the CTA above the fold.” Tests fine, teaches nothing about why.
  • The metric without the effect size. “Lift conversion” with no MDE. Sample-size calculation has nothing to work with.
  • Hypothesising after the data. HARKing - looking at the result and writing a hypothesis that matches it after the fact. Common in segment analysis.
  • The unfalsifiable hypothesis. “Users will find this more intuitive.” How would you know? The hypothesis needs an observable consequence to be testable.
  • Testing taste, not behaviour. “The team prefers this design.” Not a hypothesis. Test it on users if you want to know, but don’t claim it’s a CRO test.