Skip to content

Building an experimentation programme

A programme is what you build when you want experimentation to outlast the people who set it up. A few one-off A/B tests aren’t a programme. A pattern of recurring tests, shared learnings, and infrastructure that supports both is.

The components, roughly in order of importance:

Without the right tooling, every test costs an order of magnitude more than it should. The minimum:

  • A testing platform that handles randomisation, assignment consistency, and basic statistical analysis correctly. Many “platforms” don’t.
  • Analytics that can join experiment data with downstream metrics (LTV, retention, support tickets).
  • A way to surface running tests to the rest of the org so nobody accidentally launches conflicting tests on the same surface.
  • Documented pre-registration and analysis templates.

The recurring rhythms that produce tests:

  • Hypothesis backlog. A live document of testable ideas, ranked by expected value.
  • Weekly or fortnightly test launches. Predictable cadence beats sporadic bursts.
  • Standing analysis review. A regular meeting where finished tests get interpreted and ship / no-ship decisions get made together.
  • Win and loss documentation. Both shipped winners and failed tests written up so the learning compounds across the team.

The hardest part. The cultural shifts:

  • Failure is data, not embarrassment. Most tests fail. A programme where failed tests get hidden or rationalised will run fewer tests and learn less.
  • Stakeholders accept that the test decides. No post-hoc “I think we should ship it anyway”. If that conversation happens, the programme isn’t a programme yet.
  • The HiPPO problem is checked. Senior stakeholders don’t override results because they have a hunch.
  • Hypotheses come from everywhere. Customer support, paid media team, product, design - not just CRO specialists. The best hypotheses often come from people who see customer behaviour daily.

A loose progression:

  1. One-offs. Occasional A/B tests when someone has a strong opinion to settle. No real programme.
  2. Habitual. Tests run regularly but each is bespoke. Inconsistent rigour, slow analysis.
  3. Templated. Standard pre-registration, defined metric set, repeatable analysis. Velocity climbs.
  4. Compounding. Multiple parallel tests, holdouts for long-run measurement, shared learning across teams. The programme is a source of ongoing competitive advantage.

Most programmes plateau at stage 2 or 3. Getting to stage 4 requires sustained investment in infrastructure and culture, often from leadership.

  • Starting with infrastructure and never building the cultural muscle. The fanciest platform doesn’t help if nobody trusts the results.
  • Starting with culture and never investing in infrastructure. The team is willing but every test takes a month.
  • Buying a platform and assuming the programme follows. Platforms enable, they don’t generate.
  • Measuring programme health by test count alone. Quality of hypotheses matters too. 50 button-colour tests learn less than 10 well-chosen ones.