Sensitivity analysis

Sensitivity analysis is taking the test result and re-checking it under alternative analytical choices. Different outlier-handling rule. Different definition of the metric. Different inclusion criteria. Different statistical model. If the conclusion holds across reasonable variations, the result is robust. If small choices flip the conclusion, the result is fragile.

It’s a sanity check, not a primary analysis. The point isn’t to find the “right” answer among the variations - it’s to know whether the answer you committed to in the pre-registration is sturdy or precarious.

What to vary

The choices most worth re-testing:

Outlier treatment. Did the win depend on a few extreme orders or sessions? Trim them, winsorise them, or remove them entirely. If the result vanishes, the headline was the outliers, not the intervention.
Metric definition. Did the variant lift “revenue per session” but not “revenue per user”? Different denominator, different result. The choice should match the unit of randomisation.
Exclusion criteria. Some teams exclude bots, internal traffic, or short sessions. Try the analysis with and without each exclusion.
Statistical model. If you used a t-test, re-run with a permutation test or bootstrap. If you used naive variance, re-run with the delta method.
Time period. Does the result hold if you exclude the first day (novelty) or the last weekend (seasonality)?

How robust is robust enough

The honest threshold: if any reasonable analytical choice flips the conclusion, the result is fragile and shouldn’t be shipped without further validation. If the magnitude moves but the direction is stable, you have a real but imprecise result. If the conclusion holds across all variations, ship.

This isn’t free. Sensitivity analysis takes engineering time and discipline. Most CRO programmes don’t do it because the primary result already crossed alpha and the team is ready to move on.

Things people get wrong

Treating sensitivity analysis as optional. For high-stakes ship decisions it’s the difference between “we saw a winner” and “we saw a winner we’re confident in”.
Using sensitivity analysis to find the framing where the result wins. That’s data dredging in reverse. The point is to challenge the result, not to defend it.
Not pre-specifying which sensitivity checks you’ll run. Otherwise you pick the ones that confirm your preferred conclusion.
Skipping it on “obvious” results. The obvious ones can still be artefacts of a quirky outlier or definitional choice.