AA testing

An AA test is a control-vs-control experiment. Both groups get the same experience, no variant is being tested. It uses the same machinery as an A/B test, but the point is to validate that the testing infrastructure produces honest results when there’s no real effect to detect.

If everything’s working, an AA test at alpha = 0.05 should be “significant” about 5% of the time by chance, with both groups showing roughly equal performance on average. If your AA tests come back significant more often than that, or one side systematically wins, something is broken upstream of the analysis.

What AA tests catch

Bad randomisation. If assignment is correlated with anything that predicts the outcome (device, time of day, traffic source), AA tests will show false positives more often than 5%.
Sample ratio mismatch. Even an AA test can produce SRM if the assignment mechanism is broken. The variant side may be losing users (slower JS, redirect issues, etc.).
Instrumentation drift. If event tracking fires differently between the two cohorts, the metrics diverge even though the experience is identical.
Analysis pipeline bugs. If the variance estimate is wrong or the delta method isn’t applied where it should be, AA tests flag this as inflated significance rates.

How to run them

A few standard cadences:

Permanent background AA. Run a low-traffic AA continuously. Look at the false-positive rate over many test windows.
Quarterly diagnostic AA. Spin up a focused AA every quarter to verify the platform is calibrated. Useful when infrastructure has changed.
Pre-launch AA. Before running a real test on a new surface, run an AA on it first to confirm assignment and tracking work as expected.

A reasonable threshold: across a batch of AA tests, the rate of “significant” results should be close to your nominal alpha. If you’re running at α = 0.05 and 20% of your AA tests are significant, something is wrong.

Things people get wrong

Expecting AA tests to be flat. They shouldn’t be - by definition some will cross alpha by chance. The check is on the rate, not on individual outcomes.
Running too few AA tests to detect calibration issues. You need a meaningful sample of AA test outcomes (10+) before the false-positive rate is informative.
Ignoring AA failures as “just noise”. A consistently inflated false-positive rate is a real bug, even if it’s hard to pin down which component is broken.
Assuming AA tests prove anything about the test design of real experiments. They validate infrastructure. Sample size, hypothesis quality, and metric choice still apply.