🔁
advancedintermediate20-25 min

Friedman Test vs Repeated-Measures ANOVA: When to Use Each

Repeated-measures ANOVA tests means across three or more related conditions; the Friedman test does the same job non-parametrically with ranks. Here is exactly how each works, when ranks beat means, and two fully worked examples.

What You'll Learn

  • Decide between repeated-measures ANOVA and the Friedman test based on the data, not habit.
  • Compute the Friedman statistic by hand with within-subject ranks and ties.
  • Pick the right post-hoc test for either route and report an appropriate effect size.

1. Direct Answer: Which Repeated-Measures Test to Use

Use repeated-measures ANOVA when the same subjects are measured under three or more conditions (or at three or more time points), the residuals are approximately normal, and the sphericity assumption holds — meaning the variances of the differences between every pair of conditions are roughly equal. Use the Friedman test, its non-parametric cousin, when the outcome is ordinal, the residuals are skewed or contain outliers, or the sample is too small to defend normality. Friedman ranks the conditions within each subject, so it ignores raw magnitudes entirely and asks only whether the rank order of conditions is the same across subjects. The trade-off is the usual one: when normality and sphericity hold, ANOVA has more power and gives a directly interpretable mean difference; when they fail, the F-statistic can be badly distorted while Friedman barely flinches. A Greenhouse-Geisser or Huynh-Feldt correction can rescue ANOVA from sphericity violations but cannot rescue it from heavy outliers or ordinal scales.

Key Points

  • Repeated-measures ANOVA: tests mean differences across 3+ conditions, assumes normal residuals and sphericity.
  • Friedman: tests whether the rank order of conditions differs systematically, works on ordinal data.
  • Both require that the same subjects appear under every condition (a complete within-subjects design).

2. How Repeated-Measures ANOVA Works

Repeated-measures ANOVA partitions total variance into (a) variance between subjects, (b) variance between conditions, and (c) residual error. Removing between-subjects variance is what makes the design powerful — each subject is its own control. The test statistic is F = MS_conditions / MS_error with df_conditions = k − 1 and df_error = (n − 1)(k − 1), where k is conditions and n is subjects. Compare F to the critical value. Sphericity matters here: if the variances of the pairwise condition differences differ, the F-statistic is inflated and the p-value is too small. Mauchly’s test screens for sphericity, but its low power makes it a weak gatekeeper; modern practice is to report the Greenhouse-Geisser corrected p-value by default. With k = 2 conditions you are effectively running a paired t-test.

Key Points

  • Removes between-subjects variance, boosting power over an independent-groups design.
  • F = MS_conditions / MS_error with df = k−1 and (n−1)(k−1).
  • Default to Greenhouse-Geisser–corrected p-values unless sphericity is clearly intact.

3. How the Friedman Test Works

Step 1: arrange the data so each row is a subject and each column is a condition. Step 2: rank the values WITHIN EACH ROW from 1 (smallest) to k (largest), assigning average ranks to ties. Step 3: sum the ranks in each column (R_j). Step 4: compute Q = [12 / (nk(k+1))] × Σ R_j² − 3n(k+1). Under the null Q follows a chi-square distribution with k − 1 degrees of freedom. With many ties apply Iman-Davenport’s F-correction, which is also more powerful for small n. If Q exceeds the critical value, reject — at least one condition differs in rank from the others. The Friedman test ignores raw magnitudes entirely, which is exactly why it survives outliers and ordinal scales that would torpedo ANOVA.

Key Points

  • Rank within each subject, then sum ranks per condition.
  • Q = [12/(nk(k+1))] × Σ R_j² − 3n(k+1), df = k − 1.
  • Apply the Iman-Davenport F-correction for ties and small samples.

4. Worked Example 1: Three Treatments, Continuous Outcome (RM-ANOVA)

Eight patients receive three migraine treatments A, B, C in a within-subjects crossover. Hours of pain relief per treatment look symmetric with no outliers, so RM-ANOVA fits. SS_subjects = 42.6, SS_conditions = 28.4, SS_total = 96.8, so SS_error = 96.8 − 42.6 − 28.4 = 25.8. df_conditions = 2, df_error = 14, df_subjects = 7. F = (28.4/2) / (25.8/14) = 14.2 / 1.84 = 7.71. The critical value of F at α = 0.05 with df = 2, 14 is 3.74, so reject the null; treatments differ in mean relief. A Bonferroni-corrected paired t-test post-hoc identifies B as superior to both A and C. Partial η² = SS_conditions / (SS_conditions + SS_error) = 28.4 / 54.2 = 0.524 — a large effect.

Key Points

  • Symmetric, outlier-free differences justify RM-ANOVA.
  • F = 7.71 with df = 2, 14 → p < 0.01.
  • Always report partial η² alongside the F-statistic.

5. Worked Example 2: Ordinal Rater Scores (Friedman)

Six judges rate four wine bottles on a 1–10 ordinal scale. Bottle ranks within each judge (averaging ties): R_1 = 1+2+2+1+1+1 = 8; R_2 = 2+1+1+3+3+2 = 12; R_3 = 4+4+3+4+4+4 = 23; R_4 = 3+3+4+2+2+3 = 17. n = 6, k = 4. Q = [12/(6×4×5)] × (64 + 144 + 529 + 289) − 3×6×5 = (12/120)×1026 − 90 = 102.6 − 90 = 12.6. Critical chi-square at df = 3, α = 0.05 is 7.815, so reject — bottles differ in rank. A Nemenyi or Conover post-hoc identifies bottle 3 as significantly worse than bottle 1. Kendall’s W = Q / [n(k−1)] = 12.6 / 18 = 0.70 indicates strong agreement among judges.

Key Points

  • Ordinal rating data points to Friedman, not ANOVA.
  • Q = 12.6, df = 3 → p ≈ 0.006.
  • Kendall’s W reports both effect size and rater agreement.

6. Post-Hocs, Ties, and Effect Sizes

A significant omnibus result tells you SOMETHING differs but not WHICH conditions. For RM-ANOVA use paired t-tests with Bonferroni or Holm correction, or Tukey on the marginal means; report partial η² (small ≈ 0.01, medium ≈ 0.06, large ≈ 0.14) or generalized η² for designs with between-subjects factors. For Friedman use the Nemenyi or Conover-Iman test for all pairwise comparisons, or Dunn’s test versus a control condition; report Kendall’s W as the effect size (0–1, where 1 is perfect agreement). Ties are handled differently: RM-ANOVA does not care about ties in raw values, but Friedman’s variance shrinks with many ties, which the Iman-Davenport correction fixes. Reporting both the omnibus and the post-hoc with effect size keeps your conclusion about magnitude, not just significance.

Key Points

  • RM-ANOVA: Bonferroni/Holm paired t-tests + partial η² for effect size.
  • Friedman: Nemenyi or Conover-Iman post-hoc + Kendall’s W.
  • Use Iman-Davenport when ties are frequent in the Friedman setup.

7. Running the Comparison in StatsIQ

Snap a photo of a within-subjects design and StatsIQ checks the residuals for normality and screens sphericity with Mauchly’s test plus the Greenhouse-Geisser epsilon, recommends RM-ANOVA or Friedman accordingly, then runs the chosen procedure with the full rank-or-SS partition shown step by step. It then offers the matching post-hoc (Bonferroni paired t-tests for ANOVA, Nemenyi for Friedman) with the right effect size. This content is for educational purposes only.

Key Points

  • Automatic sphericity screen with Greenhouse-Geisser epsilon.
  • Rank or SS partition table shown for both routes.
  • Matching post-hoc test plus effect size reported alongside the omnibus.

Key Takeaways

  • RM-ANOVA assumes sphericity (equal variances of pairwise condition differences); default to Greenhouse-Geisser corrected p-values.
  • Friedman ranks within each subject; ignores magnitudes; survives outliers and ordinal scales.
  • Q = [12/(nk(k+1))] × Σ R_j² − 3n(k+1), df = k − 1.
  • Friedman has ~85–90% of RM-ANOVA power under normality — cheap insurance.
  • Effect sizes: partial η² for RM-ANOVA; Kendall’s W for Friedman.

Practice Questions

1. Five subjects are measured under k = 4 conditions, the data are continuous, residuals look normal, but Mauchly’s p = 0.02. Which approach?
Use repeated-measures ANOVA with Greenhouse-Geisser correction to the degrees of freedom. The continuous, normal residuals support ANOVA; the sphericity violation is exactly what the correction is designed to handle. Friedman would be defensible but throws away information.
2. Why is partial η² not interchangeable with η² in RM-ANOVA reporting?
Partial η² removes the between-subjects variance from the denominator, so its value is inflated relative to generalized η² or classical η². For comparing effects across studies or across between- and within-subject designs, generalized η² is preferred because it scales effects to the total observed variance.
3. Friedman is significant for k = 4 conditions. Which post-hoc would you run, and why?
Nemenyi (for all pairwise comparisons) or Conover-Iman (more powerful but assumes the Friedman null held); Dunn’s test if comparing each condition to a control. All three respect the rank-based structure that drove the omnibus, unlike Tukey, which would assume a normal sampling distribution.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Technically yes — with k = 2 it reduces to the sign test on paired data, not the Wilcoxon signed-rank test. For two paired conditions the Wilcoxon signed-rank test is the better non-parametric choice because it incorporates the magnitudes of differences. Reserve Friedman for k ≥ 3.

Sphericity is the within-subjects analogue of homogeneity of variance: the variances of every pairwise difference between conditions should be roughly equal. It matters when k ≥ 3 because unequal pairwise-difference variances inflate F and shrink the p-value. With k = 2 there is only one pairwise difference and sphericity is automatically satisfied.

Yes — by ranking within subjects it discards how far apart the values actually were. When the continuous data are normal and well behaved, RM-ANOVA recovers that magnitude information and produces a tighter test. When normality fails or scales are ordinal, that lost information was unreliable anyway and ranks are the conservative choice.

W = Q / [n(k − 1)] from the Friedman statistic, where n is subjects (or judges) and k is conditions. W ranges from 0 to 1: 0.1 is weak agreement, 0.3 moderate, 0.5 strong, and above 0.7 is very strong. W > 0.7 is unusual outside of expert-judging contexts; in psychology and education 0.3–0.5 is more common.

Yes. Photograph a within-subjects dataset or problem and StatsIQ checks residual normality, runs Mauchly’s test plus the Greenhouse-Geisser epsilon for sphericity, then recommends RM-ANOVA or Friedman accordingly and runs the full procedure with the right post-hoc test attached. This content is for educational purposes only.

Related Study Guides

Browse All Study Guides