Non-Parametric Tests: When to Use Mann-Whitney, Wilcoxon, and Kruskal-Wallis
A practical guide to the three most common non-parametric tests covering when parametric assumptions fail, how rank-based tests work, and step-by-step procedures for Mann-Whitney U (two independent groups), Wilcoxon signed-rank (paired data), and Kruskal-Wallis (three or more groups).
What You'll Learn
- โIdentify when non-parametric tests are appropriate instead of parametric alternatives
- โPerform and interpret a Mann-Whitney U test for comparing two independent groups
- โPerform and interpret a Wilcoxon signed-rank test for paired or repeated-measures data
- โPerform and interpret a Kruskal-Wallis test for comparing three or more independent groups
1. When to Go Non-Parametric
Parametric tests (t-tests, ANOVA) assume that your data comes from a normal distribution with equal variances. When those assumptions hold, parametric tests are more powerful โ they are better at detecting real effects. But when the assumptions are violated, parametric results can be misleading or outright wrong. Use non-parametric tests when: your data is clearly non-normal (heavily skewed, bimodal, or with extreme outliers) and sample size is too small for the Central Limit Theorem to rescue you (roughly n < 30 per group), your data is ordinal (ranked data like survey responses on a 1-5 scale โ you can say 4 is higher than 3, but not that the difference between 3 and 4 is the same as between 4 and 5), your sample size is very small (under 15 per group), or your data has outliers that dramatically influence the mean but not the median. The trade-off is straightforward: non-parametric tests make fewer assumptions, which makes them more broadly applicable. But they have less statistical power โ roughly 95% as powerful as their parametric equivalents under normal conditions. In practice, this means you need a slightly larger sample to detect the same effect. For most applications, the power difference is negligible. The bigger risk is using a parametric test when assumptions are grossly violated and getting a misleading result.
Key Points
- โขNon-parametric tests do not assume normality โ use when data is skewed, ordinal, or has extreme outliers
- โขThey have ~95% of the power of parametric tests under normal conditions โ the power loss is usually negligible
- โขSmall samples (n < 15-30) cannot rely on CLT to normalize sampling distributions โ non-parametric is safer
- โขThe mapping: t-test -> Mann-Whitney (independent) or Wilcoxon (paired). ANOVA -> Kruskal-Wallis.
2. Mann-Whitney U Test: Two Independent Groups
The Mann-Whitney U test (also called the Wilcoxon rank-sum test โ confusingly named similarly to the paired test) compares two independent groups when you cannot assume normality. It is the non-parametric equivalent of the independent samples t-test. The test works by ranking all observations from both groups together, then checking whether the ranks are distributed evenly between the groups. If one group has systematically higher ranks, the groups differ. Procedure: (1) Combine all observations from both groups and rank them from smallest to largest. Assign average ranks for ties. (2) Sum the ranks for each group separately (R1 and R2). (3) Calculate U1 = n1*n2 + n1(n1+1)/2 - R1 and U2 = n1*n2 + n2(n2+1)/2 - R2. (4) The test statistic is the smaller of U1 and U2 (or for software, the larger โ check your software documentation). (5) Compare to the critical value or use the p-value from software. Worked example: Group A test scores: 15, 22, 18, 25. Group B: 30, 28, 35, 32. Combined ranked: 15(1), 18(2), 22(3), 25(4), 28(5), 30(6), 32(7), 35(8). R_A = 1+2+3+4 = 10. R_B = 5+6+7+8 = 26. U_A = 4*4 + 4*5/2 - 10 = 16 + 10 - 10 = 16. U_B = 4*4 + 4*5/2 - 26 = 0. U = min(16,0) = 0. With n1=n2=4, a U of 0 has p < 0.05 (check the Mann-Whitney table). The groups differ significantly โ Group B scored higher. Interpretation: Mann-Whitney tests whether the distributions of the two groups differ, not specifically whether the means differ. With similar distribution shapes, it effectively tests whether one group tends to have higher values than the other. StatsIQ generates Mann-Whitney practice problems with automatic ranking, U calculation, and interpretation guidance.
Key Points
- โขMann-Whitney = non-parametric equivalent of independent t-test. Compares two unrelated groups.
- โขThe test ranks all observations together, then checks whether ranks are evenly distributed between groups
- โขNull hypothesis: the two groups come from the same distribution (no difference)
- โขWorks with ordinal data, skewed data, and small samples where t-test assumptions fail
3. Wilcoxon Signed-Rank Test: Paired or Repeated Data
The Wilcoxon signed-rank test is the non-parametric equivalent of the paired t-test. It compares two related measurements โ before/after scores, matched pairs, or repeated measures on the same subjects. The test works by calculating the difference for each pair, ranking the absolute differences, and then comparing the sum of positive ranks to the sum of negative ranks. If the treatment has no effect, positive and negative differences should be roughly equal in magnitude. Procedure: (1) Calculate the difference (d = after - before) for each pair. (2) Discard any pairs where d = 0. (3) Rank the absolute values of the remaining differences, smallest to largest. (4) Assign each rank the sign of its difference (positive or negative). (5) Sum the positive ranks (W+) and negative ranks (W-). (6) The test statistic W is the smaller of W+ and W-. (7) Compare to the critical value table or use software. Example: Five patients measured before and after treatment. Differences: +5, -2, +8, +3, +6. Absolute values ranked: 2(1), 3(2), 5(3), 6(4), 8(5). Signed ranks: -1, +2, +3, +4, +5. W+ = 14, W- = 1. W = min(14,1) = 1. For n=5, W=1 has p < 0.05 (one-tailed). The treatment produced a significant improvement. When to use Wilcoxon over paired t-test: when the differences are clearly non-normal (skewed, with outliers), when n is small (under 20 pairs), or when the data is ordinal (patient-rated improvement on a 1-10 scale). For large samples with roughly normal differences, the paired t-test is slightly more powerful and gives equivalent results.
Key Points
- โขWilcoxon signed-rank = non-parametric equivalent of paired t-test. Compares two related measurements.
- โขRanks the absolute differences, then compares the sum of positive ranks vs negative ranks
- โขUse when paired differences are non-normal, ordinal, or sample is very small (< 20 pairs)
- โขDiscards zero differences โ only non-zero differences contribute to the test statistic
4. Kruskal-Wallis Test: Three or More Groups
The Kruskal-Wallis test is the non-parametric equivalent of one-way ANOVA. It compares three or more independent groups when the normality assumption is violated. The logic extends the Mann-Whitney approach: rank all observations across all groups together, then check whether the rank distributions differ between groups. If one or more groups have systematically higher or lower ranks, the test detects the difference. The test statistic H follows a chi-square distribution with k-1 degrees of freedom (where k is the number of groups). H = [12 / N(N+1)] * sum(R_i^2 / n_i) - 3(N+1), where N is the total sample size, R_i is the sum of ranks in group i, and n_i is the size of group i. If H exceeds the chi-square critical value at your chosen alpha, at least one group differs from the others. But like ANOVA, a significant Kruskal-Wallis does not tell you which groups differ โ you need post-hoc pairwise comparisons (typically Dunn's test with Bonferroni or Holm correction) to identify the specific group differences. When to use Kruskal-Wallis instead of ANOVA: when data is ordinal (e.g., pain ratings across three treatment groups), when distributions are clearly non-normal across groups, or when sample sizes are small and unequal. For large samples with approximately normal distributions, one-way ANOVA is more powerful and gives equivalent conclusions. StatsIQ generates Kruskal-Wallis practice problems including ranking across groups, H calculation, and Dunn's post-hoc comparison interpretation.
Key Points
- โขKruskal-Wallis = non-parametric equivalent of one-way ANOVA. Compares 3+ independent groups.
- โขH statistic follows chi-square distribution with k-1 degrees of freedom
- โขSignificant H means at least one group differs โ use Dunn's test for pairwise post-hoc comparisons
- โขBest for ordinal data, non-normal distributions, or small unequal group sizes
Key Takeaways
- โ Non-parametric tests rank data instead of using raw values โ this makes them robust to outliers and non-normality
- โ t-test equivalent: Mann-Whitney (independent) or Wilcoxon signed-rank (paired). ANOVA equivalent: Kruskal-Wallis.
- โ Non-parametric tests have ~95% power of parametric equivalents under normal conditions โ minimal practical loss
- โ Kruskal-Wallis requires post-hoc tests (Dunn's) to identify which specific groups differ โ same logic as ANOVA
- โ For ordinal data (survey scales, ratings), non-parametric tests are always more appropriate than parametric
Practice Questions
1. You have pain ratings (1-10 scale) from two treatment groups, each with 12 patients. The data is ordinal and right-skewed. Which test should you use?
2. Three different fertilizers are tested on 8 plants each. Growth measurements are normally distributed. Should you use Kruskal-Wallis or one-way ANOVA?
FAQs
Common questions about this topic
You can, and the results will be valid. But you sacrifice about 5% statistical power compared to parametric tests when the parametric assumptions actually hold. For small samples where detecting a real effect is already hard, that 5% power loss could mean the difference between significant and non-significant. When assumptions are met, use parametric. When they are violated, use non-parametric. When you are unsure, run both and compare โ if they agree, the conclusion is robust.
Yes. StatsIQ generates non-parametric test problems including Mann-Whitney U, Wilcoxon signed-rank, and Kruskal-Wallis with step-by-step ranking procedures, test statistic calculation, and result interpretation.