Building an experimentation programme

A programme is what you build when you want experimentation to outlast the people who set it up. A few one-off A/B tests aren’t a programme. A pattern of recurring tests, shared learnings, and infrastructure that supports both is.

The components, roughly in order of importance:

Infrastructure

Without the right tooling, every test costs an order of magnitude more than it should. The minimum:

A testing platform that handles randomisation, assignment consistency, and basic statistical analysis correctly. Many “platforms” don’t.
Analytics that can join experiment data with downstream metrics (LTV, retention, support tickets).
A way to surface running tests to the rest of the org so nobody accidentally launches conflicting tests on the same surface.
Documented pre-registration and analysis templates.

Process

The recurring rhythms that produce tests:

Hypothesis backlog. A live document of testable ideas, ranked by expected value.
Weekly or fortnightly test launches. Predictable cadence beats sporadic bursts.
Standing analysis review. A regular meeting where finished tests get interpreted and ship / no-ship decisions get made together.
Win and loss documentation. Both shipped winners and failed tests written up so the learning compounds across the team.

Culture

The hardest part. The cultural shifts:

Failure is data, not embarrassment. Most tests fail. A programme where failed tests get hidden or rationalised will run fewer tests and learn less.
Stakeholders accept that the test decides. No post-hoc “I think we should ship it anyway”. If that conversation happens, the programme isn’t a programme yet.
The HiPPO problem is checked. Senior stakeholders don’t override results because they have a hunch.
Hypotheses come from everywhere. Customer support, paid media team, product, design - not just CRO specialists. The best hypotheses often come from people who see customer behaviour daily.

Common stages of programme maturity

A loose progression:

One-offs. Occasional A/B tests when someone has a strong opinion to settle. No real programme.
Habitual. Tests run regularly but each is bespoke. Inconsistent rigour, slow analysis.
Templated. Standard pre-registration, defined metric set, repeatable analysis. Velocity climbs.
Compounding. Multiple parallel tests, holdouts for long-run measurement, shared learning across teams. The programme is a source of ongoing competitive advantage.

Most programmes plateau at stage 2 or 3. Getting to stage 4 requires sustained investment in infrastructure and culture, often from leadership.

Things people get wrong

Starting with infrastructure and never building the cultural muscle. The fanciest platform doesn’t help if nobody trusts the results.
Starting with culture and never investing in infrastructure. The team is willing but every test takes a month.
Buying a platform and assuming the programme follows. Platforms enable, they don’t generate.
Measuring programme health by test count alone. Quality of hypotheses matters too. 50 button-colour tests learn less than 10 well-chosen ones.