🎯
advancedintermediate20-25 min

Tukey HSD Post-Hoc Test After ANOVA: Worked Examples

A significant ANOVA tells you the group means differ — Tukey’s HSD tells you which pairs. Here is the studentized range, the HSD formula, a full comparison table, and how Tukey stacks up against Bonferroni, Scheffé, and Dunnett.

What You'll Learn

  • Explain why post-hoc control is needed after a significant ANOVA.
  • Compute Tukey’s HSD and build a pairwise comparison table by hand.
  • Choose Tukey, Bonferroni, Scheffé, or Dunnett for the right situation.

1. Direct Answer: What Tukey HSD Does

After a one-way ANOVA returns a significant F, you know that at least one pair of group means differs — but not which. Tukey’s Honestly Significant Difference (HSD) test compares EVERY pair of means while holding the family-wise error rate (the chance of any false positive across all comparisons) at your chosen α. It does this with the studentized range distribution, which models the gap between the largest and smallest of several means drawn from the same population. For each pair you compute the mean difference and compare it to a single critical distance, HSD; if the difference exceeds HSD, that pair is significantly different. Tukey is the standard choice when you want ALL pairwise comparisons among groups of roughly equal size.

Key Points

  • Tukey HSD identifies which specific pairs of means differ after a significant ANOVA.
  • It controls the family-wise error rate across all pairwise comparisons.
  • Best when you want every pairwise comparison and group sizes are similar.

2. Why Not Just Run Multiple t-Tests

Each t-test carries its own α — say 0.05. Run several and the chance of at least one false positive balloons. With k groups there are k(k−1)/2 pairs; for 4 groups that is 6 comparisons. If each runs at α = 0.05 independently, the family-wise error rate is roughly 1 − (1 − 0.05)^6 = 0.265 — a 27% chance of a spurious "significant" pair even when nothing differs. By 5 groups (10 comparisons) it is about 40%. Tukey’s HSD keeps the overall rate at 5% by widening the bar a single comparison must clear, using the studentized range rather than the ordinary t. That is the whole point of a post-hoc correction: many looks, one error budget.

Key Points

  • k groups produce k(k−1)/2 pairwise comparisons.
  • Uncorrected, family-wise error ≈ 1 − (1 − α)^c — about 27% for 6 comparisons.
  • Tukey spends one 5% error budget across all comparisons.

3. The Studentized Range and the HSD Formula

The critical value comes from the studentized range distribution q, which depends on α, the number of groups k, and the within-group degrees of freedom df_within = N − k from the ANOVA. The honest significant difference is HSD = q(α, k, df_within) × √(MS_within / n), where MS_within is the mean square error from the ANOVA table and n is the per-group sample size. Any pair whose absolute mean difference exceeds HSD is significant at the family-wise α. For UNEQUAL group sizes, use the Tukey-Kramer modification, which replaces √(MS_within/n) with √[(MS_within/2)(1/n_i + 1/n_j)] for each pair. The MS_within you plug in is exactly the denominator that made your ANOVA F significant — Tukey reuses the pooled error estimate.

Key Points

  • HSD = q(α, k, df_within) × √(MS_within / n).
  • q comes from the studentized range table, indexed by α, k, and df_within.
  • Unequal n → Tukey-Kramer with √[(MS_within/2)(1/n_i + 1/n_j)].

4. Worked Example: Four Fertilizers

Four fertilizers, n = 6 plants each (N = 24), one-way ANOVA already significant with MS_within = 9.0 and df_within = 20. Group mean yields: A = 18, B = 22, C = 24, D = 28. First the critical q: for α = 0.05, k = 4, df = 20, the studentized range value is q ≈ 3.96. HSD = 3.96 × √(9.0 / 6) = 3.96 × √1.5 = 3.96 × 1.225 = 4.85. Now compare every pair against 4.85. |A−B| = 4 (not significant). |A−C| = 6 (significant). |A−D| = 10 (significant). |B−C| = 2 (not significant). |B−D| = 6 (significant). |C−D| = 4 (not significant). Conclusion: D beats A, B, and C’s relationship to A is significant for C, and A differs from C and D. In a compact letters display, A and B share a group, B and C share a group, and D stands apart at the top — neighbors within ~4.85 are statistically indistinguishable.

Key Points

  • Look up q for (α, k, df_within); here q ≈ 3.96.
  • HSD = 3.96 × √(9/6) = 4.85 — the single bar every pair must clear.
  • Compare each |mean difference| to HSD to flag significant pairs.

5. Confidence Intervals for the Differences

Tukey also yields simultaneous confidence intervals: for each pair, (mean_i − mean_j) ± HSD. These intervals hold jointly at the family-wise confidence level (95% for α = 0.05), unlike a stack of independent t-intervals. Using the example, the A−D difference is −10 ± 4.85 = (−14.85, −5.15) — it excludes zero, confirming significance, and tells you the plausible SIZE of the gap, not just that one exists. Reporting these intervals is better practice than reporting only which pairs were flagged, because it communicates magnitude and precision together.

Key Points

  • Simultaneous CI for each pair: (mean_i − mean_j) ± HSD.
  • Intervals hold jointly at the family-wise confidence level.
  • An interval excluding zero is the same verdict as |difference| > HSD.

6. Tukey vs Bonferroni vs Scheffé vs Dunnett

Tukey HSD: all pairwise comparisons, equal-ish n, most powerful for that job. Bonferroni: divide α by the number of comparisons — simple and general, but conservative (low power) once you have many comparisons; fine for a small pre-planned set. Scheffé: the most conservative, but it covers ALL possible contrasts, not just pairwise (e.g., comparing the average of two groups to a third) — use it when you explore complex contrasts. Dunnett: specialized for comparing several treatment groups to a SINGLE control and nothing else — more powerful than Tukey for that narrower question because it makes fewer comparisons. Pick the procedure that matches the comparisons you actually intend to make; using Tukey when you only care about-versus-control wastes power.

Key Points

  • Tukey: all pairwise, equal n — the default workhorse.
  • Bonferroni: simple, conservative, good for a few planned comparisons.
  • Scheffé: covers all contrasts (most conservative); Dunnett: treatments vs one control.

7. Running Post-Hoc Tests in StatsIQ

Snap a photo of an ANOVA table or the raw grouped data and StatsIQ runs the omnibus F, and when it is significant, produces the Tukey HSD comparison table — looking up q for your k and df_within, computing HSD (or Tukey-Kramer for unequal groups), flagging each significant pair, and giving the simultaneous confidence intervals. It will also suggest Dunnett if your design is treatments-versus-control. This content is for educational purposes only.

Key Points

  • Omnibus F first, then an automatic Tukey HSD table when significant.
  • q lookup, HSD, and simultaneous CIs all shown.
  • Suggests Dunnett for treatment-vs-control designs.

Key Takeaways

  • Tukey HSD = q(α, k, df_within) × √(MS_within / n); reuses the ANOVA’s pooled error.
  • It controls family-wise error across all k(k−1)/2 pairwise comparisons.
  • Uncorrected multiple t-tests inflate error to ≈ 1 − (1 − α)^c.
  • Unequal group sizes → Tukey-Kramer modification.
  • Dunnett for treatments-vs-control; Scheffé for arbitrary contrasts; Bonferroni for a few planned comparisons.

Practice Questions

1. ANOVA with 3 groups (n = 5 each) has MS_within = 8 and df_within = 12. With q = 3.77, what is HSD?
HSD = 3.77 × √(8/5) = 3.77 × √1.6 = 3.77 × 1.265 = 4.77. Any pair of means differing by more than 4.77 is significant at the family-wise α = 0.05.
2. You have 6 groups. How many pairwise comparisons, and what is the uncorrected family-wise error at α = 0.05?
6×5/2 = 15 comparisons. Uncorrected family-wise error ≈ 1 − (0.95)^15 = 0.537 — about a 54% chance of at least one false positive. This is why a post-hoc correction is mandatory.
3. Your study compares three new drugs against one placebo and nothing else. Which post-hoc is most powerful?
Dunnett’s test. It makes only the three treatment-vs-control comparisons rather than all six pairwise comparisons, so it preserves more power than Tukey for this specific design.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Traditionally yes — the omnibus F acts as a gatekeeper. In modern practice Tukey HSD controls the family-wise error rate on its own, so some statisticians run it directly. The conservative, widely taught workflow is: significant F first, then Tukey to locate the differences. Follow the convention your course or journal expects.

It models the distribution of the range (largest minus smallest) among k sample means drawn from the same normal population, scaled by the standard error. Because Tukey compares the biggest gap among means, using this distribution — rather than the ordinary t — is what correctly controls the family-wise error across all pairwise comparisons.

Yes, via the Tukey-Kramer modification, which adjusts the standard error per pair using √[(MS_within/2)(1/n_i + 1/n_j)]. It is the default in most software when groups are unbalanced and performs well unless sizes are wildly different, in which case Games-Howell (which also drops the equal-variance assumption) is worth considering.

For a small number of pre-planned comparisons, Bonferroni can be competitive and is simpler. But for the full set of all pairwise comparisons, Bonferroni is more conservative than Tukey — it loses power — because it does not exploit the structure of the studentized range. Use Bonferroni for a handful of specific planned contrasts, Tukey for all pairs.

Snap a photo of the ANOVA output or grouped data; StatsIQ runs the F test and, when significant, builds the Tukey HSD table — q lookup, HSD value, each pairwise verdict, and simultaneous confidence intervals — and recommends Dunnett or Games-Howell when the design or variances call for it. This content is for educational purposes only.

Related Study Guides

Browse All Study Guides