📏
advancedintermediate25 min

Effect Size Measures: Cohen's d, Eta-Squared, and Why P-Values Are Not Enough

A practical guide to effect size — what it measures, why statistical significance alone is misleading, how to calculate and interpret Cohen's d and eta-squared, and how reporting effect sizes makes your research more honest and more useful.

What You'll Learn

  • Explain why statistical significance without effect size is misleading and potentially harmful
  • Calculate and interpret Cohen's d for comparing two group means
  • Calculate and interpret eta-squared and partial eta-squared for ANOVA designs
  • Report effect sizes alongside p-values using APA and standard scientific formatting

1. The Problem with P-Values Alone

A p-value tells you whether an effect is likely real (not due to chance). It does not tell you whether the effect is large enough to matter. This distinction is the most important thing in applied statistics that most introductory courses underemphasize. With a large enough sample, any difference — no matter how trivial — becomes statistically significant. A drug that lowers blood pressure by 0.5 mmHg will produce p < 0.001 if you test 50,000 people. The effect is statistically real. But 0.5 mmHg is clinically meaningless — no doctor would prescribe a drug for that benefit. Without knowing the effect size, you cannot tell whether this drug is a medical breakthrough or a waste of money. The p-value gives you the same small number either way. The replication crisis in psychology and social science was partly driven by this problem. Researchers published findings that were statistically significant but had tiny effect sizes. When other labs tried to replicate with smaller (more realistic) sample sizes, the effects disappeared — not because they were fake, but because they were too small to detect without enormous samples. Reporting effect size alongside significance would have flagged these as trivially small effects from the beginning. Effect size answers a different and arguably more important question than the p-value: how big is this effect? Is it a huge difference or a barely perceptible one? This information is essential for practical decision-making — whether you are a doctor deciding about a drug, a teacher evaluating a curriculum, or a business testing a new feature.

Key Points

  • P-values tell you IF an effect exists. Effect sizes tell you HOW BIG the effect is. Both are needed.
  • With large samples, trivially small effects become statistically significant — p-value alone is misleading
  • The replication crisis was partly caused by publishing significant results with tiny, unreported effect sizes
  • Effect size is essential for practical decisions: is this difference big enough to act on?

2. Cohen's d: Comparing Two Group Means

Cohen's d is the most widely used effect size for comparing two group means (independent samples t-test scenario). It expresses the difference between the means in standard deviation units. Formula: d = (M1 - M2) / s_pooled, where M1 and M2 are the group means and s_pooled is the pooled standard deviation. The pooled standard deviation combines the variability of both groups: s_pooled = sqrt[((n1-1)s1² + (n2-1)s2²) / (n1 + n2 - 2)]. Interpretation follows Cohen's (1988) benchmarks: d = 0.2 is a small effect (the groups overlap substantially — about 85% overlap in distributions), d = 0.5 is a medium effect (about 67% overlap), and d = 0.8 is a large effect (about 53% overlap). These benchmarks are rough guidelines, not rigid thresholds. A d of 0.3 might be practically important in one context (a cheap intervention that affects millions) and trivial in another (an expensive treatment for a rare condition). Worked example: A tutoring program is tested. The tutored group scores M = 78, s = 12 on the final exam. The control group scores M = 72, s = 14. Pooled SD: assume n1 = n2 = 30. s_pooled = sqrt[((29)(144) + (29)(196)) / 58] = sqrt[(4176 + 5684) / 58] = sqrt[169.97] = 13.04. Cohen's d = (78 - 72) / 13.04 = 0.46. A medium effect size. The tutoring program produces roughly half a standard deviation improvement — meaningful in educational contexts. StatsIQ generates practice problems that give you raw group data and ask you to calculate d, interpret its magnitude, and discuss whether the effect is practically meaningful given the context.

Key Points

  • Cohen's d = (M1 - M2) / pooled SD — expresses group differences in standard deviation units
  • Benchmarks: d = 0.2 (small), 0.5 (medium), 0.8 (large) — but always interpret in context
  • A d of 0.5 means the groups overlap by about 67% — substantial but visible differences in distributions
  • Always report d alongside your t-test p-value — a significant t-test with d = 0.1 is statistically real but probably trivial

3. Eta-Squared and Partial Eta-Squared: Effect Size for ANOVA

When comparing more than two groups (ANOVA), Cohen's d does not apply directly because there are more than two means to compare. Instead, use eta-squared (η²) or partial eta-squared (η²p), which measure the proportion of total variability explained by the factor. Eta-squared: η² = SS_between / SS_total. This tells you the percentage of total variance in the dependent variable that is accounted for by group membership. If η² = 0.10, the grouping variable explains 10% of the variability in the outcome. Partial eta-squared: η²p = SS_effect / (SS_effect + SS_error). This is used in factorial ANOVA designs with multiple factors. It tells you how much variance the factor explains relative to unexplained variance, holding other factors constant. Partial eta-squared is what most statistical software (SPSS, R) reports by default for ANOVA effects. Cohen's benchmarks for η²: 0.01 = small, 0.06 = medium, 0.14 = large. For η²p, the benchmarks are similar but slightly higher because partial eta-squared tends to give larger values than eta-squared in multi-factor designs. Worked example: A one-way ANOVA testing three teaching methods yields SS_between = 450, SS_within = 2,250, SS_total = 2,700. Eta-squared = 450 / 2,700 = 0.167. A large effect — the teaching method accounts for 16.7% of the variability in test scores. This means that while other factors (prior knowledge, motivation, intelligence) explain most of the score differences, the teaching method itself has a substantial impact. R-squared (R²) from regression analysis is conceptually the same as eta-squared — it measures the proportion of variance explained. When you run a simple linear regression, R² tells you the effect size: how much of the outcome variability your predictor explains. An R² of 0.35 means your model explains 35% of the variation — a large effect by any standard.

Key Points

  • Eta-squared (η²) = SS_between / SS_total — proportion of total variance explained by the grouping variable
  • Partial eta-squared (η²p) is used in factorial designs — reported by default in most statistical software
  • Benchmarks for η²: 0.01 (small), 0.06 (medium), 0.14 (large)
  • R-squared from regression is conceptually identical to eta-squared — both measure proportion of variance explained

4. How to Report Effect Sizes: Practical Formatting

APA (7th edition) guidelines require effect sizes to be reported alongside all inferential test results. This is not a suggestion — it is a publication requirement for APA journals and expected in most statistics courses. For a t-test: "The tutored group scored significantly higher (M = 78, SD = 12) than the control group (M = 72, SD = 14), t(58) = 2.48, p = .016, d = 0.46." Note that d is reported right alongside the p-value. This gives the reader both pieces of information: the effect is statistically significant AND it is a medium-sized effect. For an ANOVA: "There was a significant effect of teaching method on test scores, F(2, 87) = 8.73, p < .001, η² = .167." The reader immediately sees that teaching method explains about 17% of score variability — a large and practically meaningful effect. For a correlation: r itself IS an effect size. Cohen's benchmarks: r = .10 (small), r = .30 (medium), r = .50 (large). "Study hours and exam score were significantly correlated, r(48) = .42, p = .003." The r of .42 is a medium-to-large effect — knowing study hours predicts exam score moderately well. The most important reporting practice: interpret the effect size in practical terms, not just statistical terms. Do not just say d = 0.46. Say the tutoring program improved scores by about half a standard deviation, which translates to roughly 6 points on a 100-point exam. Connecting the abstract number to a concrete, understandable quantity is what makes effect size reporting valuable to readers who are not statisticians.

Key Points

  • APA 7th edition requires effect size reporting alongside all inferential test results
  • Format: t-test reports d, ANOVA reports η² or η²p, correlation reports r (which is itself an effect size)
  • Always interpret effect size in practical terms — half a standard deviation means X points on the actual scale used
  • r benchmarks: .10 (small), .30 (medium), .50 (large). These apply when r is used as an effect size for correlations.

Key Takeaways

  • P-values tell you IF an effect exists. Effect sizes tell you HOW BIG it is. Report both, always.
  • Cohen's d benchmarks: 0.2 (small), 0.5 (medium), 0.8 (large)
  • Eta-squared benchmarks: 0.01 (small), 0.06 (medium), 0.14 (large)
  • With large enough samples, any non-zero difference becomes significant — effect size prevents over-interpretation
  • APA 7th edition requires effect size reporting for all inferential statistics

Practice Questions

1. Group A: M = 50, SD = 8, n = 25. Group B: M = 45, SD = 10, n = 25. Calculate Cohen's d.
s_pooled = sqrt[((24)(64) + (24)(100)) / 48] = sqrt[(1536 + 2400) / 48] = sqrt[82] = 9.06. d = (50 - 45) / 9.06 = 0.55. This is a medium effect size — the groups differ by about half a standard deviation.
2. An ANOVA yields F(3, 96) = 4.52, p = .005. SS_between = 270, SS_total = 3,600. Calculate and interpret eta-squared.
η² = 270 / 3,600 = 0.075. This is a medium effect size — the grouping variable explains 7.5% of the total variability. The effect is statistically significant (p = .005) and practically meaningful, though most of the variability is explained by other factors.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Match the effect size to the test. t-test = Cohen's d. ANOVA = eta-squared or partial eta-squared. Correlation = r. Regression = R-squared. Chi-square = Cramer's V. If in doubt, Cohen's d and R-squared are the most universally understood and widely reported.

Yes. StatsIQ generates problems that require calculating Cohen's d, eta-squared, and R-squared from raw data, interpreting magnitudes using Cohen's benchmarks, and reporting results in proper APA format.

More Study Guides