Analytics & Growth

A/B testing sample size and significance

The math behind a real result — and the shortcuts that quietly produce false wins.

8 min read Updated April 29, 2026

A/B testing looks like a coin flip with a results page. It isn't. Behind the green-up-arrow on every dashboard is a pile of statistics that, if you skip them, quietly fills your roadmap with false wins. The math isn't hard. The discipline to wait for it is.

Why the math matters

Run a test on too little traffic and the result is noise. The variant looks 12% better, you ship it, and three months later the number has drifted back. The test wasn't measuring a real lift; it was measuring the random variation that exists in any sample. Statistics is the toolkit that tells you when the difference between A and B is real and when it's the dice talking.

Two numbers do most of the work: sample size and statistical significance. Sample size tells you how much traffic you need before the test can detect the lift you care about. Significance tells you the probability that the difference you observed is real rather than random.

Sample size: calculate before you launch

Required sample size depends on three inputs: your baseline conversion rate, the minimum detectable effect (MDE) — the smallest lift worth caring about — and your chosen significance level and power.

Baseline conversion rate — your current rate. Lower baselines need more traffic per variant to detect the same relative lift.
Minimum detectable effect — the smallest lift you'd actually ship. A test designed to detect a 1% lift needs roughly 25 times the sample of one designed to detect a 5% lift. Be honest about what's worth shipping.
Significance level — usually 95% (alpha = 0.05). Higher is stricter and needs more sample.
Statistical power — usually 80%. Power is the probability of detecting a real effect when one exists.

Plug those into a sample-size calculator before the test starts and you get the per-variant traffic requirement. If the answer is "you need eight weeks of traffic to detect a 2% lift," that's not bad luck — that's the test telling you the experiment isn't viable, and you should either pick a higher-impact change or accept you can only detect bigger lifts.

Statistical significance: what 95% actually means

A p-value of 0.05 (95% significance) means: if there were no real difference between variants, you'd see a result this extreme by chance about 5% of the time. It does not mean "there's a 95% chance B is better than A." That subtle difference matters once you start running tests in volume.

Run twenty tests with no real effects, all at 95% significance, and one of them is expected to falsely reach significance. This is why teams that run lots of tests need a stricter bar — and why the discipline of pre-registering hypotheses and ignoring tests that "almost won" matters. The dashboard's green arrow is a probabilistic statement, not a verdict.

Pitfalls that quietly fake wins

Most false wins come from a small list of repeat offenders. The math is fine; the human running the test is the failure mode.

Peeking and stopping early. Calling the test the moment significance is hit produces dramatically inflated false-positive rates. Pre-commit to a sample size and run to it. Sequential testing methods exist if you genuinely need to monitor — they require a different statistical framework, not just stronger willpower.
Running too short to absorb weekday effects. User behavior on Tuesday differs from Sunday. A three-day test can win on weekday traffic and lose on weekends. Run at least one full week, ideally two.
Novelty effects. A redesigned page often spikes engagement in the first 48 hours from existing users reacting to the change, then settles to a real baseline. Discount the first few days of any visible UI test.
Sample ratio mismatch. If your 50/50 split is showing 47/53, traffic isn't being split as expected — usually a tracking bug — and the test is broken. Check the split before trusting the result.
Multiple comparisons. Running ten variants against one control and shipping the highest-performing one is a near-guaranteed false win without a corrected significance level.
Segment-snooping after the fact. "It didn't win overall, but it won for mobile users in California" is the sound of a false positive. Pre-specify segments you'll analyze.

Design tests that can win

Tests fail silently when the change isn't bold enough to move the metric. Subtle copy tweaks rarely produce detectable lifts on small traffic. The teams getting consistent wins from landing-page A/B testing are testing meaningful changes — different value props, restructured pages, alternative hero treatments — and reserving the small-tweak tests for high-traffic pages where they can actually be measured.

This is also why test prioritization matters more than test volume. A well-designed CRO program calculates the required sample size per test and only ships the ones that fit the available traffic budget. Half the value of statistical thinking is killing tests before they run.

Tracking that doesn't lie

Even perfect statistics fail if the underlying tracking is broken. Two checks save most teams from publishing fictional results:

Verify event firing in production for both control and variant before launching. The conversion event missing on one variant is the classic tracking bug.
Tag traffic sources cleanly with a consistent UTM taxonomy so you can segment results without guessing what "facebook" vs "Facebook" means.

And once a winner ships, the lift should show up in your business metrics. If your test claimed a 15% conversion lift and your actual conversion rate didn't budge, the test was wrong — and that audit is how you avoid feeding fiction into your marketing ROI calculations a quarter later.

The shortest version: calculate sample size before launch, run for full weekly cycles, don't peek, and validate winners against business metrics. Skip any of the four and your test results are vibes.

Frequently asked

How long should an A/B test run?

Long enough to hit your pre-calculated sample size and at least one full business week — ideally two — to absorb day-of-week effects. Calling tests on calendar time alone is one of the most common sources of false wins.

Can I look at results during the test?

You can monitor for tracking issues and sample ratio mismatch, but don't make stop/ship decisions based on early significance. Repeated peeking inflates false-positive rates dramatically. Pre-commit to sample size and run to it.

What if a test shows no significant difference?

It means you couldn't detect a difference at your chosen significance level given the sample you ran. Either the variants are genuinely equivalent or the lift is smaller than your test was powered to find. Either way, ship neither and pick a bigger swing for the next test.

How many variants should I test against the control?

For most teams, one variant against one control. Multi-variant tests need correspondingly larger sample sizes and stricter significance correction. If you need to test more options, consider sequential rounds rather than one big multi-arm test.

Is statistical significance the same as practical significance?

No. A test can be statistically significant and still produce a lift too small to matter — or one that disappears once you account for the cost of building it. Always evaluate the absolute size of the lift against the effort to ship and maintain the change.