Guardrail metrics

Guardrail metrics are the things you watch to make sure your variant isn’t quietly hurting something important. They’re not what you’d call the test on, they’re the alarms that go off when the primary metric lift is fake or comes at a hidden cost.

The structure: define both pre-test. The primary metric is what you optimise. Guardrails are what you protect. The variant ships only if the primary metric crosses threshold AND no guardrail breaches its limit.

Common guardrail metrics

The standard set in eCommerce and SaaS:

Bounce rate - if conversion lift came with a bounce-rate spike, the variant probably hurt traffic quality.
Time on task or completion time - watching for tests that “win” conversion by speeding people through a flow they’d have completed anyway with more deliberation.
Latency / page speed - new tests often add JS or DOM that slows the page. Guardrail catches the perf regression.
Error rate - JS errors, server errors, broken interactions. Self-explanatory but often skipped.
Customer support contact rate - the most underused guardrail. If the variant drives more support tickets, the experience is worse even if the conversion lifted.
Cancellation or refund rate - especially for subscription. A win at checkout that triples cancellations is a loss.
Cross-funnel impact - a checkout win that tanks PDP traffic-to-checkout rate. Tests can pull conversion from earlier in the funnel rather than create it.

Pre-defined thresholds

Guardrails need pre-committed limits, the same way the primary metric needs a pre-committed threshold. Otherwise the guardrail becomes “yeah but the lift is so good, we can absorb the bounce-rate hit”, which is metric-shopping in reverse.

A practical pattern: define for each guardrail what counts as “no significant degradation”. Usually a one-tailed comparison with a more permissive alpha - you’re checking for damage in one direction, and you’d rather flag a false alarm than miss a real regression.

What guardrails won’t catch

Long-run effects that take longer than the test to surface. That’s what holdouts are for.
Interactions with other tests running on adjacent surfaces. Multiple tests sharing traffic can pollute each other’s metrics in ways no single guardrail will flag.
Reputational or qualitative downsides. A guardrail can flag a metric regression, not a brand voice mismatch.