A/B Testing Done Right: Experiment Design, Sample Size, and Avoiding False Discoveries
A practical guide to A/B testing covering how to design valid experiments, calculate the sample size you actually need, choose the right statistical test, interpret results without fooling yourself, and avoid the most common mistakes that produce false discoveries in industry A/B tests.
What You'll Learn
- โDesign a valid A/B test with proper randomization, control, and a single clearly defined metric
- โCalculate the minimum sample size needed to detect a meaningful effect with adequate power
- โInterpret A/B test results correctly โ including when a "significant" result might be a false positive
- โIdentify and avoid common A/B testing mistakes: peeking, multiple comparisons, and survivorship bias
1. What A/B Testing Actually Is (and Is Not)
A/B testing is a randomized controlled experiment applied to real-world decisions โ typically in product, marketing, or UX. You split your audience randomly into two groups, show one group the current version (control, or A) and the other group a modified version (treatment, or B), measure a predefined outcome metric, and use a statistical test to determine whether the difference in outcomes is larger than what random chance would produce. That sounds straightforward. In practice, most A/B tests in industry are done badly. A 2019 analysis by Kohavi, Tang, and Xu (the team behind Microsoft's experimentation platform) estimated that a significant fraction of A/B tests at major tech companies produce misleading results due to design flaws, premature stopping, or incorrect interpretation. The problem is not the statistics โ the math is well-understood. The problem is that people skip the design phase, peek at results mid-experiment, declare victory too early, and mistake statistical significance for practical importance. Here is what A/B testing is NOT: it is not running two versions for a day and seeing which one "feels" better. It is not looking at the numbers after 100 visitors and declaring a winner. It is not testing 15 variations simultaneously and picking the one with the best conversion rate. Each of these approaches produces results that are indistinguishable from random noise dressed up as data-driven decisions. A properly designed A/B test has four elements: a clearly defined hypothesis (changing the button color from blue to green will increase the click-through rate), a single primary metric (click-through rate, not click-through rate and time on page and bounce rate and revenue), a predetermined sample size (calculated before the test starts), and a predetermined analysis plan (which statistical test, what significance level, when to analyze).
Key Points
- โขA/B testing is a randomized controlled experiment โ random assignment to control and treatment groups is non-negotiable
- โขMost industry A/B tests are flawed due to design issues, premature stopping, or incorrect interpretation โ not bad math
- โขFour required elements: defined hypothesis, single primary metric, predetermined sample size, predetermined analysis plan
- โขRunning two versions and "seeing which feels better" is not A/B testing โ it is guessing with extra steps
2. Sample Size: Why Your Test Needs More Data Than You Think
The most common A/B testing failure is running the test with too few observations. Underpowered tests miss real effects (false negatives) and, paradoxically, the "significant" results from underpowered tests are more likely to be false positives or dramatically overestimated effect sizes. Sample size depends on four inputs: the baseline conversion rate (your current metric โ e.g., 3% click-through rate), the minimum detectable effect (MDE โ the smallest improvement you care about, e.g., a 10% relative increase from 3.0% to 3.3%), the significance level (alpha, typically 0.05 โ the probability of a false positive you are willing to accept), and statistical power (1 - beta, typically 0.80 โ the probability of detecting a real effect if it exists). For a standard two-proportion z-test with alpha = 0.05 and power = 0.80, detecting a 10% relative lift from a 3% baseline conversion rate requires approximately 86,000 observations per group โ 172,000 total. That surprises people. They expected thousands, not hundreds of thousands. But the math is unforgiving: small baseline rates and small effect sizes require enormous samples. The formula (simplified): n per group โ (Z_alpha/2 + Z_beta)ยฒ ร (p1(1-p1) + p2(1-p2)) / (p1 - p2)ยฒ. For alpha = 0.05, Z_alpha/2 = 1.96. For power = 0.80, Z_beta = 0.84. With p1 = 0.03 and p2 = 0.033, this gives approximately 85,000-87,000 per group. If you do not have enough traffic to reach the required sample size in a reasonable time (2-4 weeks), you have three options: increase the MDE (decide you only care about larger effects โ detecting a 20% relative lift requires roughly 1/4 the sample size of detecting a 10% lift), increase alpha (accept a higher false positive rate โ moving from 0.05 to 0.10 reduces sample size by about 20%), or test on a higher-variance metric (revenue per visitor requires fewer observations than binary conversion rate when there is meaningful revenue variation). What you should never do: just run the test with whatever traffic you have and hope for the best. An underpowered test tells you nothing โ a non-significant result does not mean there is no effect, and a significant result is unreliable. StatsIQ includes sample size calculators for proportions, means, and power curve visualizations that show how sample size changes with different MDE and power assumptions.
Key Points
- โขDetecting a 10% relative lift on a 3% baseline requires ~86,000 observations per group (172K total) at 80% power
- โขFour inputs: baseline rate, minimum detectable effect, significance level (alpha), and power (1-beta)
- โขIf you cannot reach the sample size: increase MDE, increase alpha, or test a higher-variance metric. Do NOT just run underpowered.
- โขUnderpowered tests miss real effects AND produce unreliable significant results โ they are worse than no test at all
3. Running the Test: Randomization, Duration, and the Peeking Problem
Once you have your hypothesis, metric, and sample size, execution seems simple: split traffic, wait, analyze. But the execution phase has its own traps. Randomization must be truly random and persistent. Each user should be randomly assigned to A or B once and stay there for the duration of the test. If a user sees version A on Monday and version B on Wednesday, the test is contaminated โ you are no longer comparing two groups, you are comparing a mess. Use a hashing function on user ID (or cookie ID for anonymous users) that deterministically assigns each user to a group. Do not randomize by session, by time of day, or by page load. Duration: run the test for at least one full business cycle, typically 1-2 weeks minimum, even if you reach the required sample size faster. Why? Because behavior varies by day of week (weekday vs weekend), time of day, and external events (a product launch, a holiday, a viral social media post). A test that runs Monday through Wednesday captures only weekday behavior. If your treatment effect differs between weekdays and weekends, your result is biased. The peeking problem is the single most common source of false positives in industry A/B testing. Peeking means checking the results before the predetermined sample size is reached and stopping the test when the result looks significant. Here is why this is dangerous: if you check a test after every 1,000 observations, the probability of seeing p < 0.05 at some point during the test is not 5% โ it is 20-30%, depending on how often you check. You are running multiple hypothesis tests (one each time you peek) without adjusting for multiple comparisons, which inflates the false positive rate dramatically. The fix: either commit to analyzing only at the predetermined sample size (the classical approach), or use a sequential testing framework (like group sequential design or always-valid p-values) that is specifically designed to allow interim looks while controlling the false positive rate. Bayesian A/B testing also handles this more naturally because it updates a probability distribution rather than conducting a discrete hypothesis test at each look. StatsIQ includes peeking simulation exercises where you can see how checking results at different frequencies inflates the false positive rate.
Key Points
- โขRandomization must be user-level and persistent โ the same user must always see the same version throughout the test
- โขRun for at least 1-2 full weeks to capture day-of-week variation, even if sample size is reached sooner
- โขPeeking (checking results before planned sample size) inflates false positive rates from 5% to 20-30%
- โขUse sequential testing or Bayesian methods if you need to monitor results before the planned endpoint
4. Interpreting Results: What Significant and Not Significant Actually Mean
Your test is complete. The sample size has been reached. Now you analyze โ and this is where most people get it wrong. A statistically significant result (p < 0.05) means: if there were truly no difference between A and B, the probability of observing a difference this large or larger by random chance is less than 5%. That is it. It does not mean there is a 95% chance the treatment is better. It does not mean the effect is large or important. It does not mean the result will replicate. It means the observed data is unlikely under the null hypothesis of no difference. A non-significant result (p >= 0.05) means: the data does not provide sufficient evidence to reject the null hypothesis. It does NOT mean there is no effect. It does not mean A and B are the same. It means you could not detect a difference with the data you collected. If your test was underpowered, a non-significant result is expected even when a real effect exists โ you just did not have enough data to find it. Practical significance vs statistical significance: a test with 500,000 users per group can detect a 0.01 percentage point difference in conversion rate (3.00% vs 3.01%) as statistically significant. But is a 0.01 percentage point improvement worth the engineering cost to implement? Almost certainly not. Always report the effect size alongside the p-value. The confidence interval is more informative than the p-value alone โ it tells you both the estimated effect size and the precision of the estimate. A 95% CI of [0.1%, 2.3%] increase tells you the true effect is probably somewhere in that range. Multiple comparisons: if you test one metric, your false positive rate is 5%. If you test 20 metrics, the probability that at least one is significant by chance is 1 - (0.95)^20 = 64%. When you report the one significant metric out of 20 without adjusting, you are cherry-picking noise. Apply a correction: Bonferroni (divide alpha by the number of tests โ conservative but simple) or Benjamini-Hochberg (controls the false discovery rate โ less conservative, more appropriate for exploratory analysis). StatsIQ includes A/B test interpretation exercises that distinguish between statistical and practical significance and practice multiple comparison corrections.
Key Points
- โขp < 0.05 means the data is unlikely under H0 โ NOT that there is a 95% probability the treatment works
- โขNon-significant does not mean no effect โ it means insufficient evidence. Could be underpowered.
- โขAlways report effect size and confidence interval alongside the p-value โ statistical significance alone is not enough
- โขTesting 20 metrics gives a 64% chance of at least one false positive. Apply Bonferroni or Benjamini-Hochberg correction.
Key Takeaways
- โ Sample size for 10% relative lift on 3% baseline at 80% power: ~86,000 per group (172K total)
- โ Peeking at A/B test results inflates false positive rate from 5% to 20-30% โ use sequential methods if you must look early
- โ p < 0.05 means unlikely under H0, not 95% probability the treatment works. Non-significant does not mean no effect.
- โ Testing 20 metrics simultaneously: 64% chance of at least one false positive without correction
- โ Always run for at least 1-2 full weeks โ day-of-week effects can bias shorter tests
Practice Questions
1. Your website has a 2% conversion rate and gets 5,000 visitors per day. Your product manager wants to A/B test a new checkout flow and detect a 5% relative improvement (2.0% to 2.1%). How long will the test take?
2. You run an A/B test for 3 weeks. The result: control conversion = 4.2%, treatment conversion = 4.5%, p = 0.03. The product manager celebrates and wants to ship the change. You checked the results every day during the test. Is the result trustworthy?
FAQs
Common questions about this topic
The minimum is until you reach the predetermined sample size, but never less than one full business cycle (typically 1-2 weeks). Day-of-week effects, paycheck cycles, and weekly behavioral patterns can bias tests that run for only a few days. If your required sample size is reached in 3 days, still run for 7-14 days and analyze at the end. The extra cost of waiting is trivial compared to the cost of acting on a biased result.
Yes. StatsIQ includes sample size calculators for proportions and means, power curve visualization tools, peeking simulation exercises that demonstrate false positive inflation, and NCLEX-style interpretation scenarios where you must distinguish between statistical and practical significance.