🎚️

advancedintermediate18-22 min

Bonferroni vs Holm vs FDR: Multiple Comparisons Correction Worked

When you run many hypothesis tests, some will be “significant” by pure chance. Here is exactly how Bonferroni, Holm-Bonferroni, and the Benjamini-Hochberg FDR procedure differ — with worked examples of each on the same 10 p-values.

What You'll Learn

✓Distinguish the family-wise error rate from the false discovery rate.
✓Apply Bonferroni, Holm-Bonferroni, and Benjamini-Hochberg corrections by hand.
✓Pick the right correction based on how costly false positives are in context.

1. Direct Answer: FWER vs FDR

When you run m hypothesis tests at α = 0.05, the chance of at least one false positive can balloon to 1 − (1 − 0.05)^m — about 40% with m = 10 and 64% with m = 20. Multiple-comparisons procedures rein this in. Family-wise error rate (FWER) procedures, such as Bonferroni and Holm-Bonferroni, control the probability of ANY false positive in the family of tests. Use them when even one false positive is costly — confirmatory clinical trials, regulatory submissions, primary endpoints. False discovery rate (FDR) procedures, especially Benjamini-Hochberg, control the EXPECTED PROPORTION of false positives among the tests you call significant. Use them in exploratory or high-throughput settings where many tests are expected and you can tolerate some false positives — genomics, A/B test screening, hypothesis-generating analyses. Bonferroni is the most conservative, Holm is uniformly more powerful than Bonferroni at the same FWER, and Benjamini-Hochberg is markedly more powerful than either but controls a different (looser) quantity.

Key Points

•FWER: chance of ANY false positive. Bonferroni and Holm control it.
•FDR: expected PROPORTION of false positives among rejections. BH controls it.
•Bonferroni < Holm in power; both < FDR in power but stricter in guarantee.

2. Bonferroni: The Sledgehammer

Reject H_i if p_i ≤ α / m, where m is the number of tests in the family. With m = 10 and α = 0.05, the per-test threshold is 0.005. The procedure controls FWER under any dependence structure and is trivially easy to apply, which is why it remains the default in regulated settings. Its weakness is conservativeness — by the time m hits 20 or 30 the threshold is so tight that real effects with moderate p-values are missed. The Bonferroni-corrected p-value is min(p_i × m, 1), which lets you report adjusted p-values that can be compared to the original α directly.

Key Points

•Threshold: α / m for every test.
•Adjusted p-value: min(p_i × m, 1).
•Controls FWER under any dependence — robust but conservative.

3. Holm-Bonferroni: The Same Guarantee, More Power

Sort the m p-values from smallest to largest: p_(1) ≤ p_(2) ≤ … ≤ p_(m). Compare them sequentially to α / m, α / (m − 1), α / (m − 2), … α / 1. Reject p_(1) if p_(1) ≤ α / m; if so, test p_(2) at α / (m − 1); continue. The procedure STOPS at the first failure and accepts all remaining nulls. Holm is uniformly more powerful than Bonferroni at the same FWER because later tests get progressively looser thresholds, yet the FWER guarantee is identical. The adjusted p-values are p_holm,(i) = max over j ≤ i of [(m − j + 1) × p_(j)], capped at 1, which guarantees monotonicity in i.

Key Points

•Sort ascending; test against α/m, α/(m−1), …, α/1 in order.
•Stop at first failure; accept all remaining.
•Same FWER guarantee as Bonferroni, strictly more power.

4. Benjamini-Hochberg FDR: Power When You Have Many Tests

Sort p-values ascending: p_(1) ≤ … ≤ p_(m). For target FDR level q, find the largest k such that p_(k) ≤ (k / m) × q. Reject all H_(1), …, H_(k). The procedure controls the EXPECTED PROPORTION of false discoveries among rejections at q under independence and positive dependence (Benjamini-Yekutieli extends to arbitrary dependence at a logarithmic cost). With m = 10 and q = 0.05, the threshold ramps from 0.005 for p_(1) up to 0.050 for p_(10), so a borderline p like 0.038 still gets called significant if enough smaller p-values support it. This dramatic ramp is why BH is the standard in genomics (m in the thousands) — Bonferroni would zero out the entire study.

Key Points

•Sort ascending; reject all up through the largest k with p_(k) ≤ (k/m) × q.
•Controls FDR ≈ expected false-positive proportion among rejections.
•Standard in high-throughput settings (genomics, A/B test screening).

5. Worked Example: Same 10 P-Values, Three Procedures

p-values sorted: 0.001, 0.005, 0.010, 0.022, 0.038, 0.045, 0.080, 0.180, 0.300, 0.610 (m = 10). Bonferroni at α = 0.05: threshold = 0.005, so reject only p_(1) = 0.001 and p_(2) = 0.005 (boundary). Adjusted p-values multiplied by 10: 0.01, 0.05, 0.10, 0.22, …. Two rejections. Holm at α = 0.05: thresholds 0.005, 0.0056, 0.0063, 0.0071, 0.0083, 0.010, 0.0125, 0.0167, 0.025, 0.05. Compare in order. p_(1) = 0.001 ≤ 0.005 ✓; p_(2) = 0.005 ≤ 0.0056 ✓; p_(3) = 0.010 ≤ 0.0063 ✗ → stop. Two rejections, same as Bonferroni here. Benjamini-Hochberg at q = 0.05: thresholds = (k/10) × 0.05 = 0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045, 0.050. Walk from the bottom up: p_(10) = 0.610 > 0.050 ✗; p_(9) = 0.300 > 0.045 ✗; p_(8) = 0.180 > 0.040 ✗; p_(7) = 0.080 > 0.035 ✗; p_(6) = 0.045 ≤ 0.030 ✗; p_(5) = 0.038 ≤ 0.025 ✗; p_(4) = 0.022 ≤ 0.020 ✗; p_(3) = 0.010 ≤ 0.015 ✓. Largest k satisfying the rule is k = 3, so reject p_(1), p_(2), p_(3) — three rejections, picking up p = 0.010 that Holm rejected.

Key Points

•Same data: Bonferroni 2, Holm 2, FDR 3 rejections.
•BH walks DOWN from the largest p; the largest passing k captures all smaller ones.
•The gap between FWER and FDR widens dramatically with larger m.

6. Picking the Right Procedure

Confirmatory tests with regulatory or clinical-decision stakes — Bonferroni or Holm at the protocol-specified family. Primary and secondary endpoints in trials almost always use a Holm or fixed-sequence hierarchical procedure. Exploratory screening of hundreds or thousands of variables (genome-wide association, drug-screening hits, A/B test multivariate) — BH FDR at q = 0.05 or 0.10. Pilot studies and hypothesis-generating analyses — BH because you will validate hits in a follow-up. If you want a single number to report alongside, give adjusted p-values: Bonferroni × m, Holm’s step-down values, or BH-adjusted q-values. The original raw p alone is meaningless once you ran more than one test.

Key Points

•Confirmatory + regulated: Bonferroni or Holm.
•Exploratory + high-throughput: BH FDR.
•Always report adjusted p-values, not raw p, after correction.

7. Running the Correction in StatsIQ

Paste or photograph the family of raw p-values and StatsIQ applies Bonferroni, Holm-Bonferroni, and Benjamini-Hochberg side by side, returning the adjusted p-values and the per-procedure rejection set at your chosen α or q. It flags families where the procedures disagree so you can choose between FWER and FDR for the context. This content is for educational purposes only.

Key Points

•Three procedures applied side by side; rejection set per procedure.
•Adjusted p-values returned, ready to drop into a manuscript.
•Disagreement between FWER and FDR is flagged for context-based choice.

Key Takeaways

★FWER controls P(any false positive); FDR controls expected proportion of false positives among rejections.
★Bonferroni: reject if p ≤ α/m. Conservative; works under any dependence.
★Holm: sort ascending, test against α/m, α/(m−1), …, α/1; stop at first failure.
★Benjamini-Hochberg: sort ascending, find largest k with p_(k) ≤ (k/m)q; reject all 1..k.
★Adjusted p-values let you keep the α = 0.05 mental model after correction.

Practice Questions

1. You run 20 tests at α = 0.05 with no correction. What is the probability of at least one false positive if all 20 nulls are true?

1 − (1 − 0.05)^20 = 1 − 0.358 = 0.642, or about 64%. The probability of zero false positives across 20 independent tests at α = 0.05 each is 0.358, so the complementary probability of at least one is 0.642.

2. Your sorted p-values are 0.002, 0.012, 0.030, 0.040, 0.080 (m = 5). Which are rejected by Holm at α = 0.05?

Thresholds 0.010, 0.0125, 0.0167, 0.025, 0.05. p_(1) = 0.002 ≤ 0.010 ✓; p_(2) = 0.012 ≤ 0.0125 ✓; p_(3) = 0.030 ≤ 0.0167 ✗ → stop. Two rejections.

3. Same p-values. What does Benjamini-Hochberg at q = 0.05 reject?

Thresholds (k/5) × 0.05 = 0.010, 0.020, 0.030, 0.040, 0.050. Walk down: p_(5) = 0.080 > 0.050 ✗; p_(4) = 0.040 ≤ 0.040 ✓. Largest k = 4, so reject p_(1) through p_(4) — four rejections.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

They are closely related but not identical. The BH-adjusted p-value for a given test is the smallest q at which that test would be rejected under BH; it is interpreted on the FDR scale. The Storey q-value uses an estimate of the proportion of true nulls (π₀) to adapt the threshold and is slightly more powerful when many nulls are clearly true. Most software reports adjusted p-values labeled as q-values in the BH sense.

Under arbitrary negative dependence among the test statistics, BH is not guaranteed to control FDR. The Benjamini-Yekutieli variant divides each threshold by Σ(1/i) (about ln(m) + 0.577) to control FDR under any dependence at the cost of substantial conservatism. In genomic and most applied settings, the dependence is approximately positive and BH is safe.

Yes when the secondary endpoints are tested for confirmatory claims; the regulatory expectation is a pre-specified multiplicity strategy — Holm, fixed-sequence, gatekeeping, or a graphical Bonferroni-Holm approach — covering the family. Exploratory analyses can be reported without correction provided they are clearly labeled exploratory and not used for label claims.

The number you ran from the same conceptual family. Defining the family is the hard part: a single trial’s endpoints, the contrasts within an ANOVA, the loci screened in a GWAS. Garden-of-forking-paths multiplicity — different model specifications, subsetting, transformations — is rarely corrected for explicitly but should be acknowledged. Pre-registration removes the ambiguity.

It will not choose for you because the choice depends on stakes you set, but it will run all three procedures side by side and flag where they disagree so you can pick the one that matches the confirmatory-versus-exploratory nature of your analysis. This content is for educational purposes only.

Related Study Guides

🧪 fundamentals

Browse All Study Guides

🎯 AP Statistics 🔬 Introduction to 📈 Regression Analysis 🎲 Probability Foundations 📊 Understanding Statistical 🧪 ANOVA and 📉 Data Visualization 🔄 Bayesian vs 📊 What Is 📐 What Is 🔗 Correlation vs 📐 Central Limit 📏 Confidence Intervals:📐 P-Values and 📐 Chi-Square Tests ⚠️ Type I 🎲 Sampling Methods 📈 Introduction to 📏 Effect Size 📉 Multiple Regression:🔀 Non-Parametric Tests:🎯 How to 🧪 A/B Testing 🧹 Data Cleaning ⏱️ Survival Analysis:🔗 Introduction to 📈 Time Series 🔬 Principal Component 🔀 How to 📐 Two-Sample t-Test 📊 How to 🔀 Paired vs 📋 How to 📊 Z-Scores and 📈 R Squared 🎲 Binomial Probability 🎲 Expected Value 📐 Standard Error 🎯 Margin of 📊 Contingency Tables 📉 Poisson Distribution:📏 Cohen's d 🔗 Pearson vs ⚖️ One-Tailed vs 🔔 Normal Distribution 📉 Linear Regression 📊 Mean vs 🎯 Confidence vs 📊 Two-Way ANOVA:⚡ Statistical Power 🎯 Conditional Probability 🎲 Permutations vs 📈 Log Transformations 🔄 Simpson's Paradox:🧪 Hypothesis Testing:🎲 Probability Distributions:📈 Central Limit ⚖️ Type I 🎯 P-Value Interpretation:↔️ One-Tailed vs 🎲 Binomial vs 📊 Normal Distribution 📈 Discrete vs 📊 Chi-Square Goodness-of-Fit 🔬 Mann-Whitney U ⏱️ Exponential Distribution:🎯 Geometric vs 🎯 Wilcoxon Signed-Rank 🎯 Kruskal-Wallis Test 🎯 Tukey HSD 🎯 Relative Risk 🔁 Friedman Test 📈 Spearman vs 🎚️ Bonferroni vs 🎯 Confidence vs ⚡ A-Priori vs

Bonferroni vs Holm vs FDR: Multiple Comparisons Correction Worked

What You'll Learn

1. Direct Answer: FWER vs FDR

Key Points

2. Bonferroni: The Sledgehammer

Key Points

3. Holm-Bonferroni: The Same Guarantee, More Power

Key Points

4. Benjamini-Hochberg FDR: Power When You Have Many Tests

Key Points

5. Worked Example: Same 10 P-Values, Three Procedures

Key Points

6. Picking the Right Procedure

Key Points

7. Running the Correction in StatsIQ

Key Points

Key Takeaways

Practice Questions

Study with AI

FAQs

Are q-values and adjusted p-values the same thing?

When does the Benjamini-Hochberg FDR procedure fail?

Should I correct across primary AND secondary endpoints in a clinical trial?

Do I correct for the number of tests I RAN or the number I planned to run?

Can StatsIQ choose between FWER and FDR for me?

Related Study Guides

Hypothesis Testing: The Complete Guide With 6 Worked Tests

Type I vs Type II Errors: Worked Examples and Tradeoffs

P-Value Interpretation: Common Mistakes and Correct Reading

A/B Testing Done Right: Experiment Design, Sample Size, and Avoiding False Discoveries

Tukey HSD Post-Hoc Test After ANOVA: Worked Examples

Browse All Study Guides