Skip to content

Segment analysis

Most tests report a single overall effect. The reality is usually that the effect varies across users. Mobile responds differently to desktop. New visitors respond differently to returning ones. Paid traffic responds differently to organic. That variation is the heterogeneous treatment effect (HTE), and it’s often more useful than the overall number.

The two main reasons to look at segments:

  • The overall result hides a meaningful split. Variant A wins overall by 3%, but it’s a 10% win on mobile and a 4% loss on desktop. The clean answer is “ship to mobile, keep desktop on control” - very different from “ship to everyone”.
  • The overall result is misleading. An apparent win is driven entirely by one segment that happened to have an unusual mix during the test. Without segment analysis you’d ship a change that’s neutral or negative for most users.

The trap is the multiple testing problem. Slice the data 20 ways and at α = 0.05 you’d expect one significant segment by chance, even when there’s no real heterogeneity. Most “winning segment” findings in casual CRO are this artefact.

The honest version:

  • Pre-specify segments in the pre-registration. Major splits you expect to matter (device, audience source, customer tier) listed up front. Anything else is exploratory.
  • Apply multiple-testing corrections for any post-hoc segment analysis. Bonferroni or FDR, depending on appetite.
  • Treat exploratory segment findings as hypotheses, not conclusions. “Mobile users seem to respond better” is a hypothesis to test in a follow-up, not a result to ship on.

The segments worth checking are the ones that plausibly interact with the intervention:

  • Device type, especially for any test involving layout, speed, or interaction patterns
  • Traffic source, especially for tests on cold-traffic landing pages
  • Customer tier or product category
  • New vs returning
  • Geography or language, if the audience is internationally mixed

Slicing by inscrutable variables (browser version, exact session time, day of week) usually finds noise.

When traffic composition differs between segments during the test, the overall result and the segment results can disagree. Variant A wins in every segment but loses overall, because the traffic mix shifted. This is Simpson’s paradox and it’s the strongest argument for always looking at segments alongside the aggregate.