Bonferroni vs Holm vs FDR: Multiple Comparisons Correction Worked
When you run many hypothesis tests, some will be “significant” by pure chance. Here is exactly how Bonferroni, Holm-Bonferroni, and the Benjamini-Hochberg FDR procedure differ — with worked examples of each on the same 10 p-values.
What You'll Learn
- ✓Distinguish the family-wise error rate from the false discovery rate.
- ✓Apply Bonferroni, Holm-Bonferroni, and Benjamini-Hochberg corrections by hand.
- ✓Pick the right correction based on how costly false positives are in context.
1. Direct Answer: FWER vs FDR
When you run m hypothesis tests at α = 0.05, the chance of at least one false positive can balloon to 1 − (1 − 0.05)^m — about 40% with m = 10 and 64% with m = 20. Multiple-comparisons procedures rein this in. Family-wise error rate (FWER) procedures, such as Bonferroni and Holm-Bonferroni, control the probability of ANY false positive in the family of tests. Use them when even one false positive is costly — confirmatory clinical trials, regulatory submissions, primary endpoints. False discovery rate (FDR) procedures, especially Benjamini-Hochberg, control the EXPECTED PROPORTION of false positives among the tests you call significant. Use them in exploratory or high-throughput settings where many tests are expected and you can tolerate some false positives — genomics, A/B test screening, hypothesis-generating analyses. Bonferroni is the most conservative, Holm is uniformly more powerful than Bonferroni at the same FWER, and Benjamini-Hochberg is markedly more powerful than either but controls a different (looser) quantity.
Key Points
- •FWER: chance of ANY false positive. Bonferroni and Holm control it.
- •FDR: expected PROPORTION of false positives among rejections. BH controls it.
- •Bonferroni < Holm in power; both < FDR in power but stricter in guarantee.
2. Bonferroni: The Sledgehammer
Reject H_i if p_i ≤ α / m, where m is the number of tests in the family. With m = 10 and α = 0.05, the per-test threshold is 0.005. The procedure controls FWER under any dependence structure and is trivially easy to apply, which is why it remains the default in regulated settings. Its weakness is conservativeness — by the time m hits 20 or 30 the threshold is so tight that real effects with moderate p-values are missed. The Bonferroni-corrected p-value is min(p_i × m, 1), which lets you report adjusted p-values that can be compared to the original α directly.
Key Points
- •Threshold: α / m for every test.
- •Adjusted p-value: min(p_i × m, 1).
- •Controls FWER under any dependence — robust but conservative.
3. Holm-Bonferroni: The Same Guarantee, More Power
Sort the m p-values from smallest to largest: p_(1) ≤ p_(2) ≤ … ≤ p_(m). Compare them sequentially to α / m, α / (m − 1), α / (m − 2), … α / 1. Reject p_(1) if p_(1) ≤ α / m; if so, test p_(2) at α / (m − 1); continue. The procedure STOPS at the first failure and accepts all remaining nulls. Holm is uniformly more powerful than Bonferroni at the same FWER because later tests get progressively looser thresholds, yet the FWER guarantee is identical. The adjusted p-values are p_holm,(i) = max over j ≤ i of [(m − j + 1) × p_(j)], capped at 1, which guarantees monotonicity in i.
Key Points
- •Sort ascending; test against α/m, α/(m−1), …, α/1 in order.
- •Stop at first failure; accept all remaining.
- •Same FWER guarantee as Bonferroni, strictly more power.
4. Benjamini-Hochberg FDR: Power When You Have Many Tests
Sort p-values ascending: p_(1) ≤ … ≤ p_(m). For target FDR level q, find the largest k such that p_(k) ≤ (k / m) × q. Reject all H_(1), …, H_(k). The procedure controls the EXPECTED PROPORTION of false discoveries among rejections at q under independence and positive dependence (Benjamini-Yekutieli extends to arbitrary dependence at a logarithmic cost). With m = 10 and q = 0.05, the threshold ramps from 0.005 for p_(1) up to 0.050 for p_(10), so a borderline p like 0.038 still gets called significant if enough smaller p-values support it. This dramatic ramp is why BH is the standard in genomics (m in the thousands) — Bonferroni would zero out the entire study.
Key Points
- •Sort ascending; reject all up through the largest k with p_(k) ≤ (k/m) × q.
- •Controls FDR ≈ expected false-positive proportion among rejections.
- •Standard in high-throughput settings (genomics, A/B test screening).
5. Worked Example: Same 10 P-Values, Three Procedures
p-values sorted: 0.001, 0.005, 0.010, 0.022, 0.038, 0.045, 0.080, 0.180, 0.300, 0.610 (m = 10). Bonferroni at α = 0.05: threshold = 0.005, so reject only p_(1) = 0.001 and p_(2) = 0.005 (boundary). Adjusted p-values multiplied by 10: 0.01, 0.05, 0.10, 0.22, …. Two rejections. Holm at α = 0.05: thresholds 0.005, 0.0056, 0.0063, 0.0071, 0.0083, 0.010, 0.0125, 0.0167, 0.025, 0.05. Compare in order. p_(1) = 0.001 ≤ 0.005 ✓; p_(2) = 0.005 ≤ 0.0056 ✓; p_(3) = 0.010 ≤ 0.0063 ✗ → stop. Two rejections, same as Bonferroni here. Benjamini-Hochberg at q = 0.05: thresholds = (k/10) × 0.05 = 0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045, 0.050. Walk from the bottom up: p_(10) = 0.610 > 0.050 ✗; p_(9) = 0.300 > 0.045 ✗; p_(8) = 0.180 > 0.040 ✗; p_(7) = 0.080 > 0.035 ✗; p_(6) = 0.045 ≤ 0.030 ✗; p_(5) = 0.038 ≤ 0.025 ✗; p_(4) = 0.022 ≤ 0.020 ✗; p_(3) = 0.010 ≤ 0.015 ✓. Largest k satisfying the rule is k = 3, so reject p_(1), p_(2), p_(3) — three rejections, picking up p = 0.010 that Holm rejected.
Key Points
- •Same data: Bonferroni 2, Holm 2, FDR 3 rejections.
- •BH walks DOWN from the largest p; the largest passing k captures all smaller ones.
- •The gap between FWER and FDR widens dramatically with larger m.
6. Picking the Right Procedure
Confirmatory tests with regulatory or clinical-decision stakes — Bonferroni or Holm at the protocol-specified family. Primary and secondary endpoints in trials almost always use a Holm or fixed-sequence hierarchical procedure. Exploratory screening of hundreds or thousands of variables (genome-wide association, drug-screening hits, A/B test multivariate) — BH FDR at q = 0.05 or 0.10. Pilot studies and hypothesis-generating analyses — BH because you will validate hits in a follow-up. If you want a single number to report alongside, give adjusted p-values: Bonferroni × m, Holm’s step-down values, or BH-adjusted q-values. The original raw p alone is meaningless once you ran more than one test.
Key Points
- •Confirmatory + regulated: Bonferroni or Holm.
- •Exploratory + high-throughput: BH FDR.
- •Always report adjusted p-values, not raw p, after correction.
7. Running the Correction in StatsIQ
Paste or photograph the family of raw p-values and StatsIQ applies Bonferroni, Holm-Bonferroni, and Benjamini-Hochberg side by side, returning the adjusted p-values and the per-procedure rejection set at your chosen α or q. It flags families where the procedures disagree so you can choose between FWER and FDR for the context. This content is for educational purposes only.
Key Points
- •Three procedures applied side by side; rejection set per procedure.
- •Adjusted p-values returned, ready to drop into a manuscript.
- •Disagreement between FWER and FDR is flagged for context-based choice.
Key Takeaways
- ★FWER controls P(any false positive); FDR controls expected proportion of false positives among rejections.
- ★Bonferroni: reject if p ≤ α/m. Conservative; works under any dependence.
- ★Holm: sort ascending, test against α/m, α/(m−1), …, α/1; stop at first failure.
- ★Benjamini-Hochberg: sort ascending, find largest k with p_(k) ≤ (k/m)q; reject all 1..k.
- ★Adjusted p-values let you keep the α = 0.05 mental model after correction.
Practice Questions
1. You run 20 tests at α = 0.05 with no correction. What is the probability of at least one false positive if all 20 nulls are true?
2. Your sorted p-values are 0.002, 0.012, 0.030, 0.040, 0.080 (m = 5). Which are rejected by Holm at α = 0.05?
3. Same p-values. What does Benjamini-Hochberg at q = 0.05 reject?
FAQs
Common questions about this topic
They are closely related but not identical. The BH-adjusted p-value for a given test is the smallest q at which that test would be rejected under BH; it is interpreted on the FDR scale. The Storey q-value uses an estimate of the proportion of true nulls (π₀) to adapt the threshold and is slightly more powerful when many nulls are clearly true. Most software reports adjusted p-values labeled as q-values in the BH sense.
Under arbitrary negative dependence among the test statistics, BH is not guaranteed to control FDR. The Benjamini-Yekutieli variant divides each threshold by Σ(1/i) (about ln(m) + 0.577) to control FDR under any dependence at the cost of substantial conservatism. In genomic and most applied settings, the dependence is approximately positive and BH is safe.
Yes when the secondary endpoints are tested for confirmatory claims; the regulatory expectation is a pre-specified multiplicity strategy — Holm, fixed-sequence, gatekeeping, or a graphical Bonferroni-Holm approach — covering the family. Exploratory analyses can be reported without correction provided they are clearly labeled exploratory and not used for label claims.
The number you ran from the same conceptual family. Defining the family is the hard part: a single trial’s endpoints, the contrasts within an ANOVA, the loci screened in a GWAS. Garden-of-forking-paths multiplicity — different model specifications, subsetting, transformations — is rarely corrected for explicitly but should be acknowledged. Pre-registration removes the ambiguity.
It will not choose for you because the choice depends on stakes you set, but it will run all three procedures side by side and flag where they disagree so you can pick the one that matches the confirmatory-versus-exploratory nature of your analysis. This content is for educational purposes only.