Mann-Whitney U Test vs t-Test: When to Use Which (Worked Examples)
A practical comparison of the Mann-Whitney U test (Wilcoxon rank-sum) and the independent-samples t-test β when normality assumptions justify the t-test, when ranks are the safer choice, and how to compute and interpret both with two worked examples.
What You'll Learn
- βDecide between the Mann-Whitney U test and the independent-samples t-test based on data structure.
- βCompute the U statistic from ranks and interpret the resulting p-value.
- βQuantify the power cost of using ranks when the t-testβs assumptions actually hold.
1. Direct Answer: When to Use Mann-Whitney Instead of a t-Test
Use the Mann-Whitney U test (also called the Wilcoxon rank-sum test) when comparing two independent groups and you cannot safely assume normality β typically because the sample size is small, the data is ordinal (Likert ratings, ranks), or there are clear outliers that distort the mean. The t-test assumes the sampling distribution of the difference in means is approximately normal, which holds for any continuous data when sample sizes are large (n β₯ 30 per group) by the central limit theorem. When samples are small AND the population distribution is heavily skewed or contains outliers, the t-testβs p-value can be misleading; Mann-Whitney works on ranks and is robust to those issues. The price: when the t-testβs assumptions actually hold, Mann-Whitney is roughly 95% as efficient (about 5% less power), so reach for it only when you have a real reason to doubt normality.
Key Points
- β’Mann-Whitney is preferred for small samples with non-normal distributions or ordinal data.
- β’t-test wins on power when its normality assumption is reasonable.
- β’For large samples (n β₯ 30 per group), the central limit theorem makes the t-test robust to non-normality.
2. How the U Statistic Is Computed
Combine both groups into one list and rank from smallest to largest. Assign tied values the average of the ranks they span. Sum the ranks for group 1, call it R1. Compute U1 = R1 - n1(n1+1)/2, where n1 is the size of group 1. Compute U2 similarly for group 2; U1 + U2 = n1 Γ n2 always. Use the smaller of U1 and U2 as the test statistic U, look it up in a Mann-Whitney table for small samples, or use the normal approximation for n1, n2 β₯ 10: z = (U - n1Γn2/2) / sqrt(n1Γn2Γ(n1+n2+1)/12). The intuition: if the groups are exchangeable, the ranks shuffle randomly between them and U falls near the middle. If group 1 is systematically larger, ranks pile up in group 1 and U deviates.
Key Points
- β’Combine groups, rank everything, sum the ranks for one group (R1).
- β’U1 = R1 - n1(n1+1)/2. Use the smaller of U1 and U2.
- β’For n1, n2 β₯ 10, normal approximation: z = (U - n1Γn2/2) / sqrt(n1Γn2Γ(n1+n2+1)/12).
3. Worked Example 1: Reaction Times Under Two Conditions
Eight subjects per condition. Reaction times (ms) for condition A: 245, 252, 261, 273, 287, 295, 312, 480. Condition B: 268, 271, 274, 282, 291, 298, 305, 318. Notice the outlier in A (480) that would distort the mean. Combine and rank from smallest: A245=1, A252=2, A261=3, B268=4, B271=5, A273=6, B274=7, B282=8, A287=9, B291=10, A295=11, B298=12, B305=13, A312=14, B318=15, A480=16. R(A) = 1+2+3+6+9+11+14+16 = 62. U(A) = 62 - 8Γ9/2 = 62 - 36 = 26. U(B) = 8Γ8 - 26 = 38. Smaller U = 26. Normal approximation z = (26 - 32) / sqrt(64Γ17/12) = -6 / sqrt(90.67) = -6 / 9.52 = -0.63. Two-tailed p β 0.53. Fail to reject β no evidence the conditions differ. A naive t-test on these same data, distorted by the 480 ms outlier, would have given a much smaller p-value driven by the outlier rather than a real difference. This is exactly the situation Mann-Whitney was designed for.
Key Points
- β’Outliers inflate the t-test statistic; Mann-Whitney is unaffected because it uses ranks.
- β’The rank-sum approach handles the asymmetry created by extreme values gracefully.
- β’Report both means and medians when an outlier could swing your conclusion.
4. Worked Example 2: Likert Ratings (1-7) Between Two Groups
You compare satisfaction ratings (1-7 Likert) for two product designs. Group A (n=10): 5, 6, 4, 7, 5, 6, 5, 6, 7, 6. Group B (n=10): 3, 4, 5, 4, 3, 5, 4, 5, 4, 5. Likert data is ordinal β the distance between "5" and "6" is not guaranteed to equal the distance between "3" and "4" β so a t-test on the raw scores assumes more than the measurement actually justifies. Combine and rank with ties getting average ranks. After ranking and summing R(A) = 142.5, R(B) = 67.5. U(A) = 142.5 - 55 = 87.5. U(B) = 100 - 87.5 = 12.5. Smaller U = 12.5. z = (12.5 - 50) / sqrt(100 Γ 21 / 12) = -37.5 / sqrt(175) = -37.5 / 13.23 = -2.83. Two-tailed p β 0.0046. Reject β group A rates the product significantly higher. The interpretation is in terms of stochastic dominance: a random observation from group A is more likely to be higher than a random observation from group B, not "the means differ by X units."
Key Points
- β’Likert data is ordinal β Mann-Whitney respects the measurement scale.
- β’Conclusion is about stochastic dominance, not a mean difference.
- β’Tied observations get the average of the ranks they would have occupied.
5. Power Trade-off: The Cost of Using Ranks
When the t-testβs assumptions actually hold (normal populations, similar variances), Mann-Whitney has asymptotic relative efficiency of 0.955 β meaning you need about 5% more observations to achieve the same power as the t-test. That cost is small, and many practitioners use Mann-Whitney as the default for two-group comparisons in fields where data is reliably non-normal (clinical trials with skewed lab values, ecology counts, satisfaction surveys). When distributions are heavy-tailed or contain outliers, Mann-Whitney can be substantially MORE powerful than the t-test because the t-testβs power is throttled by the outliers inflating the standard deviation. Here is the surprising part most courses leave out: against truly contaminated or fat-tailed distributions, Mann-Whitney can be 1.5x to 3x more powerful than the t-test. It is not always the conservative choice.
Key Points
- β’Mann-Whitney has 95.5% relative efficiency vs t-test under normality.
- β’Against heavy-tailed or contaminated data, Mann-Whitney can be MORE powerful.
- β’Default to Mann-Whitney when distributional assumptions are doubtful.
6. Common Decision Tree
Two independent groups, continuous or ordinal outcome. (1) Is the outcome ordinal (Likert, ranks, ordered categories)? β Mann-Whitney. (2) Is n β₯ 30 per group and the data continuous? β t-test is robust to non-normality by CLT. (3) Is n < 30 per group? Check normality with a histogram or Shapiro-Wilk test. If normality is plausible, use t-test. If not, Mann-Whitney. (4) Are there outliers? Decide whether they are data-entry errors (drop and use t-test) or genuine observations (use Mann-Whitney to keep them without distortion). (5) Are variances roughly equal? If not, Welchβs t-test (or Mann-Whitney) handles unequal variances better than the pooled t-test.
Key Points
- β’Ordinal outcome β Mann-Whitney.
- β’Large sample continuous outcome β t-test (CLT to the rescue).
- β’Outliers you cannot defensibly drop β Mann-Whitney.
- β’Unequal variances β Welchβs t-test or Mann-Whitney.
7. Running Either Test in StatsIQ
Snap a photo of the two columns of data and StatsIQ runs both tests automatically, presenting the t-test result, Welchβs t-test result, and Mann-Whitney U result side by side along with normality diagnostics (Shapiro-Wilk, Q-Q plot, histogram) so you can defend your choice. The app calls out outliers, suggests transformations, and explains which test is most defensible given your sample size and data shape.
Key Points
- β’StatsIQ runs t-test, Welchβs t-test, and Mann-Whitney in parallel for comparison.
- β’Normality diagnostics are produced automatically.
- β’Outlier flags and transformation suggestions appear before the p-value.
Key Takeaways
- β Mann-Whitney = Wilcoxon rank-sum: nonparametric counterpart to the independent-samples t-test.
- β Compares two independent groups using ranks instead of raw values.
- β Asymptotic relative efficiency vs t-test under normality: 0.955.
- β Robust to outliers and applicable to ordinal data.
- β Test statistic U for small samples; normal approximation for n β₯ 10 per group.
Practice Questions
1. Group A (n=5): 12, 14, 15, 18, 22. Group B (n=5): 16, 19, 21, 24, 27. Compute Mann-Whitney U.
2. When would you choose a t-test over Mann-Whitney even with non-normal data?
3. A study reports "U = 47, p = 0.02" with two groups of 12 each. Interpret.
FAQs
Common questions about this topic
Yes β they are mathematically equivalent and were developed independently. Mann and Whitney published in 1947, Wilcoxon in 1945. Mann-Whitney works with the U statistic, Wilcoxon with the sum of ranks W. They produce identical p-values. Some software calls it "Mann-Whitney," some "Wilcoxon rank-sum," some "Mann-Whitney-Wilcoxon." A separate test called the Wilcoxon signed-rank test is for PAIRED data and is different.
No. n1 and n2 can differ. The U statistic naturally accounts for unequal sample sizes through the n1 Γ n2 normalization. For very unequal samples (one tiny, one large), small-sample tables become relevant earlier than n=10, so prefer the exact test in software when one group is very small.
When two or more observations have the same value, they receive the average of the ranks they would have occupied. If observations would have been ranks 5 and 6, both get 5.5. With many ties, the variance formula in the normal approximation needs a correction; most software handles this automatically. Heavy ties (more than 25% of observations tied) are a sign you might want a different test or to consider whether the outcome is truly ordinal.
No. Mann-Whitney is for two INDEPENDENT samples. For paired or repeated-measures data, use the Wilcoxon signed-rank test (the nonparametric paired t-test counterpart), which computes differences within pairs and signs/ranks those differences.
The rank-biserial correlation r = 1 - 2U / (n1 Γ n2) is a natural choice ranging from -1 to 1, interpretable as the difference between the proportion of pairs where group 1 exceeds group 2 and the proportion where group 2 exceeds group 1. It is closely related to Cohenβs effect size conventions: |r| around 0.1 is small, 0.3 is medium, 0.5 is large.
Snap a photo of two columns of data. StatsIQ runs Shapiro-Wilk and produces Q-Q plots to check normality, flags outliers using IQR criteria, and reports both the t-test and Mann-Whitney results side by side with a recommendation. The narrative explains which test is more defensible given your data shape and sample sizes, and what effect size to report. This content is for educational purposes only.