A/B testing sample size and significance
The math behind a real result — and the shortcuts that quietly produce false wins.
A/B testing looks like a coin flip with a results page. It isn't. Behind the green-up-arrow on every dashboard is a pile of statistics that, if you skip them, quietly fills your roadmap with false wins. The math isn't hard. The discipline to wait for it is.
Why the math matters
Run a test on too little traffic and the result is noise. The variant looks 12% better, you ship it, and three months later the number has drifted back. The test wasn't measuring a real lift; it was measuring the random variation that exists in any sample. Statistics is the toolkit that tells you when the difference between A and B is real and when it's the dice talking.
Two numbers do most of the work: sample size and statistical significance. Sample size tells you how much traffic you need before the test can detect the lift you care about. Significance tells you the probability that the difference you observed is real rather than random.
Sample size: calculate before you launch
Required sample size depends on three inputs: your baseline conversion rate, the minimum detectable effect (MDE) — the smallest lift worth caring about — and your chosen significance level and power.
- Baseline conversion rate — your current rate. Lower baselines need more traffic per variant to detect the same relative lift.
- Minimum detectable effect — the smallest lift you'd actually ship. A test designed to detect a 1% lift needs roughly 25 times the sample of one designed to detect a 5% lift. Be honest about what's worth shipping.
- Significance level — usually 95% (alpha = 0.05). Higher is stricter and needs more sample.
- Statistical power — usually 80%. Power is the probability of detecting a real effect when one exists.
Plug those into a sample-size calculator before the test starts and you get the per-variant traffic requirement. If the answer is "you need eight weeks of traffic to detect a 2% lift," that's not bad luck — that's the test telling you the experiment isn't viable, and you should either pick a higher-impact change or accept you can only detect bigger lifts.
Statistical significance: what 95% actually means
A p-value of 0.05 (95% significance) means: if there were no real difference between variants, you'd see a result this extreme by chance about 5% of the time. It does not mean "there's a 95% chance B is better than A." That subtle difference matters once you start running tests in volume.
Run twenty tests with no real effects, all at 95% significance, and one of them is expected to falsely reach significance. This is why teams that run lots of tests need a stricter bar — and why the discipline of pre-registering hypotheses and ignoring tests that "almost won" matters. The dashboard's green arrow is a probabilistic statement, not a verdict.
Pitfalls that quietly fake wins
Most false wins come from a small list of repeat offenders. The math is fine; the human running the test is the failure mode.
- Peeking and stopping early. Calling the test the moment significance is hit produces dramatically inflated false-positive rates. Pre-commit to a sample size and run to it. Sequential testing methods exist if you genuinely need to monitor — they require a different statistical framework, not just stronger willpower.
- Running too short to absorb weekday effects. User behavior on Tuesday differs from Sunday. A three-day test can win on weekday traffic and lose on weekends. Run at least one full week, ideally two.
- Novelty effects. A redesigned page often spikes engagement in the first 48 hours from existing users reacting to the change, then settles to a real baseline. Discount the first few days of any visible UI test.
- Sample ratio mismatch. If your 50/50 split is showing 47/53, traffic isn't being split as expected — usually a tracking bug — and the test is broken. Check the split before trusting the result.
- Multiple comparisons. Running ten variants against one control and shipping the highest-performing one is a near-guaranteed false win without a corrected significance level.
- Segment-snooping after the fact. "It didn't win overall, but it won for mobile users in California" is the sound of a false positive. Pre-specify segments you'll analyze.
Design tests that can win
Tests fail silently when the change isn't bold enough to move the metric. Subtle copy tweaks rarely produce detectable lifts on small traffic. The teams getting consistent wins from landing-page A/B testing are testing meaningful changes — different value props, restructured pages, alternative hero treatments — and reserving the small-tweak tests for high-traffic pages where they can actually be measured.
This is also why test prioritization matters more than test volume. A well-designed CRO program calculates the required sample size per test and only ships the ones that fit the available traffic budget. Half the value of statistical thinking is killing tests before they run.
Tracking that doesn't lie
Even perfect statistics fail if the underlying tracking is broken. Two checks save most teams from publishing fictional results:
- Verify event firing in production for both control and variant before launching. The conversion event missing on one variant is the classic tracking bug.
- Tag traffic sources cleanly with a consistent UTM taxonomy so you can segment results without guessing what "facebook" vs "Facebook" means.
And once a winner ships, the lift should show up in your business metrics. If your test claimed a 15% conversion lift and your actual conversion rate didn't budge, the test was wrong — and that audit is how you avoid feeding fiction into your marketing ROI calculations a quarter later.