πŸ”¬
fundamentalsintermediate25-35 min

Mann-Whitney U Test vs t-Test: When to Use Which (Worked Examples)

A practical comparison of the Mann-Whitney U test (Wilcoxon rank-sum) and the independent-samples t-test β€” when normality assumptions justify the t-test, when ranks are the safer choice, and how to compute and interpret both with two worked examples.

What You'll Learn

  • βœ“Decide between the Mann-Whitney U test and the independent-samples t-test based on data structure.
  • βœ“Compute the U statistic from ranks and interpret the resulting p-value.
  • βœ“Quantify the power cost of using ranks when the t-test’s assumptions actually hold.

1. Direct Answer: When to Use Mann-Whitney Instead of a t-Test

Use the Mann-Whitney U test (also called the Wilcoxon rank-sum test) when comparing two independent groups and you cannot safely assume normality β€” typically because the sample size is small, the data is ordinal (Likert ratings, ranks), or there are clear outliers that distort the mean. The t-test assumes the sampling distribution of the difference in means is approximately normal, which holds for any continuous data when sample sizes are large (n β‰₯ 30 per group) by the central limit theorem. When samples are small AND the population distribution is heavily skewed or contains outliers, the t-test’s p-value can be misleading; Mann-Whitney works on ranks and is robust to those issues. The price: when the t-test’s assumptions actually hold, Mann-Whitney is roughly 95% as efficient (about 5% less power), so reach for it only when you have a real reason to doubt normality.

Key Points

  • β€’Mann-Whitney is preferred for small samples with non-normal distributions or ordinal data.
  • β€’t-test wins on power when its normality assumption is reasonable.
  • β€’For large samples (n β‰₯ 30 per group), the central limit theorem makes the t-test robust to non-normality.

2. How the U Statistic Is Computed

Combine both groups into one list and rank from smallest to largest. Assign tied values the average of the ranks they span. Sum the ranks for group 1, call it R1. Compute U1 = R1 - n1(n1+1)/2, where n1 is the size of group 1. Compute U2 similarly for group 2; U1 + U2 = n1 Γ— n2 always. Use the smaller of U1 and U2 as the test statistic U, look it up in a Mann-Whitney table for small samples, or use the normal approximation for n1, n2 β‰₯ 10: z = (U - n1Γ—n2/2) / sqrt(n1Γ—n2Γ—(n1+n2+1)/12). The intuition: if the groups are exchangeable, the ranks shuffle randomly between them and U falls near the middle. If group 1 is systematically larger, ranks pile up in group 1 and U deviates.

Key Points

  • β€’Combine groups, rank everything, sum the ranks for one group (R1).
  • β€’U1 = R1 - n1(n1+1)/2. Use the smaller of U1 and U2.
  • β€’For n1, n2 β‰₯ 10, normal approximation: z = (U - n1Γ—n2/2) / sqrt(n1Γ—n2Γ—(n1+n2+1)/12).

3. Worked Example 1: Reaction Times Under Two Conditions

Eight subjects per condition. Reaction times (ms) for condition A: 245, 252, 261, 273, 287, 295, 312, 480. Condition B: 268, 271, 274, 282, 291, 298, 305, 318. Notice the outlier in A (480) that would distort the mean. Combine and rank from smallest: A245=1, A252=2, A261=3, B268=4, B271=5, A273=6, B274=7, B282=8, A287=9, B291=10, A295=11, B298=12, B305=13, A312=14, B318=15, A480=16. R(A) = 1+2+3+6+9+11+14+16 = 62. U(A) = 62 - 8Γ—9/2 = 62 - 36 = 26. U(B) = 8Γ—8 - 26 = 38. Smaller U = 26. Normal approximation z = (26 - 32) / sqrt(64Γ—17/12) = -6 / sqrt(90.67) = -6 / 9.52 = -0.63. Two-tailed p β‰ˆ 0.53. Fail to reject β€” no evidence the conditions differ. A naive t-test on these same data, distorted by the 480 ms outlier, would have given a much smaller p-value driven by the outlier rather than a real difference. This is exactly the situation Mann-Whitney was designed for.

Key Points

  • β€’Outliers inflate the t-test statistic; Mann-Whitney is unaffected because it uses ranks.
  • β€’The rank-sum approach handles the asymmetry created by extreme values gracefully.
  • β€’Report both means and medians when an outlier could swing your conclusion.

4. Worked Example 2: Likert Ratings (1-7) Between Two Groups

You compare satisfaction ratings (1-7 Likert) for two product designs. Group A (n=10): 5, 6, 4, 7, 5, 6, 5, 6, 7, 6. Group B (n=10): 3, 4, 5, 4, 3, 5, 4, 5, 4, 5. Likert data is ordinal β€” the distance between "5" and "6" is not guaranteed to equal the distance between "3" and "4" β€” so a t-test on the raw scores assumes more than the measurement actually justifies. Combine and rank with ties getting average ranks. After ranking and summing R(A) = 142.5, R(B) = 67.5. U(A) = 142.5 - 55 = 87.5. U(B) = 100 - 87.5 = 12.5. Smaller U = 12.5. z = (12.5 - 50) / sqrt(100 Γ— 21 / 12) = -37.5 / sqrt(175) = -37.5 / 13.23 = -2.83. Two-tailed p β‰ˆ 0.0046. Reject β€” group A rates the product significantly higher. The interpretation is in terms of stochastic dominance: a random observation from group A is more likely to be higher than a random observation from group B, not "the means differ by X units."

Key Points

  • β€’Likert data is ordinal β€” Mann-Whitney respects the measurement scale.
  • β€’Conclusion is about stochastic dominance, not a mean difference.
  • β€’Tied observations get the average of the ranks they would have occupied.

5. Power Trade-off: The Cost of Using Ranks

When the t-test’s assumptions actually hold (normal populations, similar variances), Mann-Whitney has asymptotic relative efficiency of 0.955 β€” meaning you need about 5% more observations to achieve the same power as the t-test. That cost is small, and many practitioners use Mann-Whitney as the default for two-group comparisons in fields where data is reliably non-normal (clinical trials with skewed lab values, ecology counts, satisfaction surveys). When distributions are heavy-tailed or contain outliers, Mann-Whitney can be substantially MORE powerful than the t-test because the t-test’s power is throttled by the outliers inflating the standard deviation. Here is the surprising part most courses leave out: against truly contaminated or fat-tailed distributions, Mann-Whitney can be 1.5x to 3x more powerful than the t-test. It is not always the conservative choice.

Key Points

  • β€’Mann-Whitney has 95.5% relative efficiency vs t-test under normality.
  • β€’Against heavy-tailed or contaminated data, Mann-Whitney can be MORE powerful.
  • β€’Default to Mann-Whitney when distributional assumptions are doubtful.

6. Common Decision Tree

Two independent groups, continuous or ordinal outcome. (1) Is the outcome ordinal (Likert, ranks, ordered categories)? β†’ Mann-Whitney. (2) Is n β‰₯ 30 per group and the data continuous? β†’ t-test is robust to non-normality by CLT. (3) Is n < 30 per group? Check normality with a histogram or Shapiro-Wilk test. If normality is plausible, use t-test. If not, Mann-Whitney. (4) Are there outliers? Decide whether they are data-entry errors (drop and use t-test) or genuine observations (use Mann-Whitney to keep them without distortion). (5) Are variances roughly equal? If not, Welch’s t-test (or Mann-Whitney) handles unequal variances better than the pooled t-test.

Key Points

  • β€’Ordinal outcome β†’ Mann-Whitney.
  • β€’Large sample continuous outcome β†’ t-test (CLT to the rescue).
  • β€’Outliers you cannot defensibly drop β†’ Mann-Whitney.
  • β€’Unequal variances β†’ Welch’s t-test or Mann-Whitney.

7. Running Either Test in StatsIQ

Snap a photo of the two columns of data and StatsIQ runs both tests automatically, presenting the t-test result, Welch’s t-test result, and Mann-Whitney U result side by side along with normality diagnostics (Shapiro-Wilk, Q-Q plot, histogram) so you can defend your choice. The app calls out outliers, suggests transformations, and explains which test is most defensible given your sample size and data shape.

Key Points

  • β€’StatsIQ runs t-test, Welch’s t-test, and Mann-Whitney in parallel for comparison.
  • β€’Normality diagnostics are produced automatically.
  • β€’Outlier flags and transformation suggestions appear before the p-value.

Key Takeaways

  • β˜…Mann-Whitney = Wilcoxon rank-sum: nonparametric counterpart to the independent-samples t-test.
  • β˜…Compares two independent groups using ranks instead of raw values.
  • β˜…Asymptotic relative efficiency vs t-test under normality: 0.955.
  • β˜…Robust to outliers and applicable to ordinal data.
  • β˜…Test statistic U for small samples; normal approximation for n β‰₯ 10 per group.

Practice Questions

1. Group A (n=5): 12, 14, 15, 18, 22. Group B (n=5): 16, 19, 21, 24, 27. Compute Mann-Whitney U.
Combined ranks: A12=1, A14=2, A15=3, B16=4, A18=5, B19=6, B21=7, A22=8, B24=9, B27=10. R(A) = 1+2+3+5+8 = 19. U(A) = 19 - 5Γ—6/2 = 19 - 15 = 4. U(B) = 25 - 4 = 21. Smaller U = 4. For n1=n2=5, critical U at alpha=0.05 two-tailed is 2; our U=4 > 2, so fail to reject.
2. When would you choose a t-test over Mann-Whitney even with non-normal data?
When sample sizes are large enough (typically n β‰₯ 30 per group) that the central limit theorem makes the sampling distribution of the mean approximately normal regardless of the underlying distribution. Also when the research question is specifically about means (e.g., "by how many ms does the drug shorten reaction time") rather than stochastic dominance.
3. A study reports "U = 47, p = 0.02" with two groups of 12 each. Interpret.
The test rejects the null at alpha=0.05. The interpretation is that one group tends to produce higher ranks than the other β€” a random observation from one group is more likely to exceed a random observation from the other. To report effect size, use the rank-biserial correlation: r = 1 - 2U / (n1 Γ— n2) = 1 - 94/144 β‰ˆ 0.35, which is a moderate effect.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Yes β€” they are mathematically equivalent and were developed independently. Mann and Whitney published in 1947, Wilcoxon in 1945. Mann-Whitney works with the U statistic, Wilcoxon with the sum of ranks W. They produce identical p-values. Some software calls it "Mann-Whitney," some "Wilcoxon rank-sum," some "Mann-Whitney-Wilcoxon." A separate test called the Wilcoxon signed-rank test is for PAIRED data and is different.

No. n1 and n2 can differ. The U statistic naturally accounts for unequal sample sizes through the n1 Γ— n2 normalization. For very unequal samples (one tiny, one large), small-sample tables become relevant earlier than n=10, so prefer the exact test in software when one group is very small.

When two or more observations have the same value, they receive the average of the ranks they would have occupied. If observations would have been ranks 5 and 6, both get 5.5. With many ties, the variance formula in the normal approximation needs a correction; most software handles this automatically. Heavy ties (more than 25% of observations tied) are a sign you might want a different test or to consider whether the outcome is truly ordinal.

No. Mann-Whitney is for two INDEPENDENT samples. For paired or repeated-measures data, use the Wilcoxon signed-rank test (the nonparametric paired t-test counterpart), which computes differences within pairs and signs/ranks those differences.

The rank-biserial correlation r = 1 - 2U / (n1 Γ— n2) is a natural choice ranging from -1 to 1, interpretable as the difference between the proportion of pairs where group 1 exceeds group 2 and the proportion where group 2 exceeds group 1. It is closely related to Cohen’s effect size conventions: |r| around 0.1 is small, 0.3 is medium, 0.5 is large.

Snap a photo of two columns of data. StatsIQ runs Shapiro-Wilk and produces Q-Q plots to check normality, flags outliers using IQR criteria, and reports both the t-test and Mann-Whitney results side by side with a recommendation. The narrative explains which test is more defensible given your data shape and sample sizes, and what effect size to report. This content is for educational purposes only.

Related Study Guides

Browse All Study Guides

🎯 AP StatisticsπŸ”¬ Introduction toπŸ“ˆ Regression Analysis🎲 Probability FoundationsπŸ“Š Understanding StatisticalπŸ§ͺ ANOVA andπŸ“‰ Data VisualizationπŸ”„ Bayesian vsπŸ“Š What IsπŸ“ What IsπŸ”— Correlation vsπŸ“ Central LimitπŸ“ Confidence Intervals:πŸ“ P-Values andπŸ“ Chi-Square Tests⚠️ Type I🎲 Sampling MethodsπŸ“ˆ Introduction toπŸ“ Effect SizeπŸ“‰ Multiple Regression:πŸ”€ Non-Parametric Tests:🎯 How toπŸ§ͺ A/B Testing🧹 Data Cleaning⏱️ Survival Analysis:πŸ”— Introduction toπŸ“ˆ Time SeriesπŸ”¬ Principal ComponentπŸ”€ How toπŸ“ Two-Sample t-TestπŸ“Š How toπŸ”€ Paired vsπŸ“‹ How toπŸ“Š Z-Scores andπŸ“ˆ R Squared🎲 Binomial Probability🎲 Expected ValueπŸ“ Standard Error🎯 Margin ofπŸ“Š Contingency TablesπŸ“‰ Poisson Distribution:πŸ“ Cohen's dπŸ”— Pearson vsβš–οΈ One-Tailed vsπŸ”” Normal DistributionπŸ“‰ Linear RegressionπŸ“Š Mean vs🎯 Confidence vsπŸ“Š Two-Way ANOVA:⚑ Statistical Power🎯 Conditional Probability🎲 Permutations vsπŸ“ˆ Log TransformationsπŸ”„ Simpson's Paradox:πŸ§ͺ Hypothesis Testing:🎲 Probability Distributions:πŸ“ˆ Central Limitβš–οΈ Type I🎯 P-Value Interpretation:↔️ One-Tailed vs🎲 Binomial vsπŸ“Š Normal DistributionπŸ“ˆ Discrete vsπŸ“Š Chi-Square Goodness-of-FitπŸ”¬ Mann-Whitney U⏱️ Exponential Distribution:🎯 Geometric vs