📏
advancedintermediate25 min

Cohen's d Effect Size: Formula, Interpretation, and When to Report It

A complete guide to Cohen's d — covering the formula, how to calculate it from typical study data, the standard interpretation thresholds, and why effect sizes are essential alongside p-values in modern statistical reporting.

What You'll Learn

  • Define Cohen's d and explain what it measures
  • Calculate Cohen's d from means, standard deviations, and sample sizes
  • Interpret Cohen's d values using standard thresholds (small, medium, large)
  • Explain why effect sizes should be reported alongside p-values

1. The Direct Answer: Cohen's d Measures Effect Size as Standardized Mean Difference

Cohen's d is a standardized effect size measure that expresses the difference between two group means in units of standard deviations. It tells you how LARGE an effect is, which is different from whether the effect is statistically significant (p-value). **Formula**: d = (M₁ - M₂) / SD_pooled Where: - M₁ and M₂ are the means of the two groups - SD_pooled is a pooled estimate of the within-group standard deviation **Pooled standard deviation**: SD_pooled = √[((n₁ - 1) × SD₁² + (n₂ - 1) × SD₂²) / (n₁ + n₂ - 2)] Where n₁ and n₂ are the sample sizes and SD₁ and SD₂ are the standard deviations of the two groups. **Standard interpretation thresholds (Cohen, 1988)**: - **d = 0.2**: small effect - **d = 0.5**: medium effect - **d = 0.8**: large effect - **d ≥ 1.2**: very large effect **Why effect sizes matter**: A small p-value (say, p < 0.001) tells you the effect is unlikely to be due to chance — it's statistically significant. But it doesn't tell you whether the effect is IMPORTANT. Two extreme cases: 1. **Small effect, large sample**: with n = 10,000, even a tiny difference (d = 0.05) can achieve p < 0.001. Statistically significant but PRACTICALLY TRIVIAL. 2. **Large effect, small sample**: with n = 15, a substantial difference (d = 0.8) might not achieve statistical significance (p = 0.10). Statistically non-significant but PRACTICALLY MEANINGFUL. Both can be misleading if reported only by p-value. Effect sizes cut through the ambiguity by describing the magnitude directly. Snap a photo of any effect size problem and StatsIQ calculates Cohen's d, interprets the magnitude, and compares it to the p-value analysis. This content is for educational purposes only.

Key Points

  • Cohen's d = (M₁ - M₂) / SD_pooled. Standardized difference between group means.
  • Interpretation: 0.2 small, 0.5 medium, 0.8 large, 1.2+ very large.
  • Independent of sample size (unlike p-values). Describes magnitude, not significance.
  • Should be reported alongside p-value in modern research. APA style requires effect sizes.

2. Calculating Cohen's d: Step-by-Step Worked Examples

Let's work through the calculation for common scenarios. **Example 1: Basic calculation from means and SDs** Group 1 (treatment): n₁ = 50, M₁ = 75, SD₁ = 12 Group 2 (control): n₂ = 50, M₂ = 70, SD₂ = 10 Step 1: Calculate pooled SD. SD_pooled = √[((50-1) × 12² + (50-1) × 10²) / (50+50-2)] = √[(49 × 144 + 49 × 100) / 98] = √[(7,056 + 4,900) / 98] = √[11,956 / 98] = √122 = 11.05 Step 2: Calculate d. d = (75 - 70) / 11.05 = 5 / 11.05 = 0.453 **Interpretation**: d ≈ 0.45 is a medium-small effect. The treatment group scored about half a standard deviation higher than the control group. **Example 2: Equal sample sizes, different standard deviations** Group 1: n₁ = 30, M₁ = 100, SD₁ = 15 Group 2: n₂ = 30, M₂ = 112, SD₂ = 18 Step 1: Pooled SD. SD_pooled = √[((30-1) × 15² + (30-1) × 18²) / (30+30-2)] = √[(29 × 225 + 29 × 324) / 58] = √[(6,525 + 9,396) / 58] = √[15,921 / 58] = √274.50 = 16.57 Step 2: Calculate d. d = (100 - 112) / 16.57 = -12 / 16.57 = -0.724 The negative sign indicates Group 1 is LOWER than Group 2. In absolute terms, |d| = 0.72 is approaching a large effect. **Example 3: Different sample sizes** Group 1: n₁ = 100, M₁ = 55, SD₁ = 10 Group 2: n₂ = 40, M₂ = 60, SD₂ = 12 Step 1: Pooled SD. SD_pooled = √[((100-1) × 10² + (40-1) × 12²) / (100+40-2)] = √[(99 × 100 + 39 × 144) / 138] = √[(9,900 + 5,616) / 138] = √[15,516 / 138] = √112.43 = 10.60 Step 2: Calculate d. d = (55 - 60) / 10.60 = -5 / 10.60 = -0.472 Absolute |d| = 0.47 is medium effect. **Example 4: Paired (repeated measures) data** For within-subjects designs (pre-post), use: d = M_difference / SD_difference Where M_difference is the mean of the difference scores and SD_difference is the standard deviation of the difference scores. Pre-test: M = 70, SD = 10 Post-test: M = 80, SD = 12 Mean of differences: 10 SD of differences: 8 d = 10 / 8 = 1.25 (very large effect) **Example 5: From t-statistic** If you have a t-statistic from an independent samples t-test: d = t × √(1/n₁ + 1/n₂) For equal n: d = 2t / √(n) (approximately) Example: t = 3.0, n₁ = n₂ = 25. d = 3.0 × √(1/25 + 1/25) = 3.0 × √0.08 = 3.0 × 0.283 = 0.849 Large effect. **Rough conversion from correlation coefficient r**: d = 2r / √(1 - r²) Useful when you have r but want to report d. StatsIQ calculates d from any combination of inputs (means + SDs, t-statistic, or correlation) and applies the correct formula based on study design.

Key Points

  • Independent samples: SD_pooled = √[((n₁-1)SD₁² + (n₂-1)SD₂²) / (n₁+n₂-2)].
  • Paired samples: d = mean difference / SD of differences.
  • From t: d = t × √(1/n₁ + 1/n₂). Quick conversion.
  • Sign of d indicates direction (positive vs negative). Magnitude is what matters for interpretation.

3. Why p-Values Alone Are Misleading

Understanding why effect sizes matter requires understanding the limitations of p-values. **The fundamental problem with p-values**: The p-value depends on BOTH the effect size AND the sample size. A large sample can make a tiny effect look 'significant.' A small sample can make a huge effect look 'non-significant.' Without effect size, you can't distinguish the two. **Demonstration**: **Scenario A**: Tiny effect, large sample. - M₁ = 100.5, M₂ = 100, SD = 10, n = 10,000 per group - Difference: 0.5 points (trivial) - d = 0.5/10 = 0.05 (extremely small effect) - t ≈ 3.5, p ≈ 0.0005 (highly significant) The result is 'statistically significant' but practically meaningless. 0.5 points on a 100-scale is within measurement error and has no real-world implications. **Scenario B**: Large effect, small sample. - M₁ = 80, M₂ = 60, SD = 15, n = 8 per group - Difference: 20 points (huge) - d = 20/15 = 1.33 (very large effect) - t ≈ 2.0, p ≈ 0.08 (not significant at 0.05) The result is 'not statistically significant' but represents a massive real effect. With only 8 per group, power is too low to detect even large effects reliably. **If you only reported p-values**: - Scenario A would be reported as a 'significant finding' and the paper might get published. - Scenario B would be reported as a 'non-finding' and the paper might be rejected. This is the OPPOSITE of the truth. Scenario B is the more important finding. **The statistical reform movement**: Since the 2000s, there's been a major movement to improve statistical reporting: 1. **APA Publication Manual (6th and 7th editions)**: requires effect sizes alongside p-values in most analyses. 2. **Medical journals (NEJM, JAMA, BMJ, Lancet)**: require effect sizes and confidence intervals. 3. **Psychology reform (post-replication crisis)**: preregistration, replication studies, effect size reporting standards. 4. **2019 ASA statement on p-values**: cautions against treating p < 0.05 as a magical threshold. **Best practice for modern reporting**: Always report: 1. **The effect size** (d, r, η², odds ratio, etc. — appropriate for your analysis) 2. **The 95% confidence interval** for the effect size 3. **The p-value** (if desired, but not as the primary result) 4. **The sample size** 5. **The observed means, SDs, and correlations** (enough to allow others to recalculate) Example sentence from a modern paper: 'The intervention group (M = 75, SD = 12) scored higher than the control group (M = 70, SD = 10); t(98) = 2.26, p = .026, d = 0.453 [95% CI 0.054, 0.852].' The d and CI tell readers the size and precision of the effect, not just whether it exceeded an arbitrary threshold. **Minimum meaningful effect size**: In applied fields, researchers often pre-specify a 'minimum meaningful effect' based on practical significance: - **Education**: d = 0.25 is often considered the minimum educationally meaningful difference - **Medicine**: effect sizes depend on outcome (small effect on mortality is huge; large effect on minor symptom is modest) - **Psychology**: d = 0.50 is often used as a practical threshold - **Economics**: depends heavily on context — a 1% effect on GDP is massive, on individual income is modest Effect sizes below the minimum meaningful threshold, even if statistically significant, may not justify intervention. StatsIQ computes effect sizes alongside traditional statistical tests and flags cases where p-values and effect sizes tell different stories.

Key Points

  • p-value depends on both effect size AND sample size. Effect size is independent.
  • Large sample + tiny effect = significant but meaningless. Small sample + large effect = non-significant but important.
  • APA and medical journals now REQUIRE effect sizes alongside p-values.
  • Report both: effect size (with 95% CI) and p-value. Effect size tells you the size; p-value tells you the reliability.

4. Beyond d: Other Effect Sizes and Converting Between Them

Cohen's d is the most common effect size, but several others are used depending on the context. **Effect sizes for different analyses**: **1. t-tests (comparing two group means)**: - Cohen's d (or Hedges' g — a bias-corrected version for small samples) - Glass's delta (when variances are very different) **2. Correlations**: - Pearson's r is itself an effect size. Interpretation: 0.1 small, 0.3 medium, 0.5 large. - Can be converted to d using: d = 2r / √(1 - r²) **3. ANOVA**: - Eta-squared (η²): proportion of total variance explained by the factor - Partial eta-squared (ηₚ²): proportion of variance explained by the factor, controlling for other factors - Omega-squared (ω²): bias-corrected version - Interpretation: 0.01 small, 0.06 medium, 0.14 large - Cohen's f: derived from η², interpretation 0.1 small, 0.25 medium, 0.4 large **4. Chi-square tests and contingency tables**: - Phi (φ) for 2×2 tables: interpretation 0.1 small, 0.3 medium, 0.5 large - Cramer's V for larger tables - Odds ratio: common in medical research. OR = 1 is no effect. OR > 1 is positive association; OR < 1 is negative. **5. Multiple regression**: - Cohen's f²: effect size for multiple regression. f² = 0.02 small, 0.15 medium, 0.35 large. - Standardized regression coefficients (β): each predictor's effect in SD units. **6. Logistic regression**: - Odds ratio for each predictor. OR = 2 means 2x higher odds for each 1-unit increase in the predictor. **Conversions between common effect sizes**: **d to r**: r = d / √(d² + 4) **r to d**: d = 2r / √(1 - r²) **d to η²**: η² = d² / (d² + (n₁ + n₂) × (n₁ + n₂ - 2) / (n₁ × n₂)) For equal n, this simplifies to: η² = d² / (d² + 4) **r to η²**: η² = r² Example conversions: - d = 0.5 → r ≈ 0.24 → η² ≈ 0.06 (all medium) - d = 0.8 → r ≈ 0.37 → η² ≈ 0.14 (all large) **Sample size affects small-sample bias**: Cohen's d has a slight upward bias for small samples. Hedges' g corrects this bias: g = d × [1 - 3/(4(n₁ + n₂) - 9)] For large samples, d and g are essentially identical. For n < 20 per group, g is preferred. **Reporting effect sizes in practice**: **In methods section**: 'Effect sizes will be reported using Cohen's d for mean comparisons and η² for ANOVA. d values of 0.2, 0.5, and 0.8 will be considered small, medium, and large effects, respectively.' **In results section**: 'The intervention produced a significantly greater improvement than the control condition (M_diff = 4.5, SD = 10.2; t(98) = 4.41, p < .001, d = 0.88 [95% CI 0.46, 1.30]).' **In discussion section**: 'The observed effect size (d = 0.88) is considered large by Cohen's (1988) conventions. This effect is comparable to that reported by previous studies of similar interventions (Smith et al., 2020, d = 0.82).' **Common effect size mistakes**: 1. **Reporting only p-values**: doesn't tell readers how large the effect is. 2. **Reporting raw mean differences without context**: a 10-point difference on a 100-point scale is different from a 10-point difference on a 1,000-point scale. Standardize. 3. **Confusing statistical significance with effect size**: 'the result was highly significant' doesn't mean it was a large effect. 4. **Ignoring confidence intervals**: the CI around d is important. A d of 0.5 with CI [0.05, 0.95] is much less precise than d of 0.5 with CI [0.40, 0.60]. 5. **Using wrong effect size for the analysis**: report d for t-tests, η² for ANOVA, r for correlations, OR for categorical outcomes. Match the effect size to the statistical test. **Power analysis and effect sizes**: Effect size is a key input to power analysis. Before running a study, you specify: - Expected effect size (from pilot study, meta-analysis, or theoretical minimum) - Desired power (usually 0.80) - Significance level (usually 0.05) The power analysis calculates required sample size. Smaller effects require larger samples. Example: detecting d = 0.5 with 80% power at α = 0.05 requires n = 64 per group (independent t-test). StatsIQ handles all these conversions, performs power analysis, and generates complete effect size reporting with confidence intervals.

Key Points

  • Choose effect size based on analysis: d for t-tests, η² for ANOVA, r for correlations, OR for chi-square.
  • Hedges g is bias-corrected version of d for small samples (n < 20 per group).
  • Conversions: r = d/√(d²+4). d = 2r/√(1-r²). All quantify effect magnitude.
  • Report effect sizes with 95% CI for precision. Effect size without CI loses information.

Key Takeaways

  • Cohen's d = (M₁ - M₂) / SD_pooled. Standardized effect size.
  • Interpretation thresholds: 0.2 small, 0.5 medium, 0.8 large, 1.2+ very large.
  • Effect size is INDEPENDENT of sample size (unlike p-values).
  • Always report d WITH 95% CI, alongside p-values. APA requires this.
  • Hedges g corrects bias for small samples. For n ≥ 20 per group, d and g are interchangeable.

Practice Questions

1. A study compares a treatment and control group. Treatment (n=30): M=22, SD=5. Control (n=30): M=18, SD=4. Calculate Cohen's d and interpret.
Step 1: Calculate pooled SD. SD_pooled = √[((30-1)×5² + (30-1)×4²) / (30+30-2)] = √[(29×25 + 29×16) / 58] = √[(725 + 464) / 58] = √[1189/58] = √20.50 = 4.53. Step 2: Calculate d. d = (22 - 18) / 4.53 = 4/4.53 = 0.88. Interpretation: d = 0.88 is a large effect (approximately at the "large" threshold of 0.80). The treatment produced a mean score 0.88 standard deviations higher than the control — a substantial effect that likely has practical significance.
2. A study reports p = 0.001 but d = 0.1. How do you interpret this?
The result is statistically significant (p < 0.05), but the effect size is extremely small (d = 0.1 is below even the "small" threshold of 0.2). This pattern is typical of studies with very large samples where small, practically meaningless differences achieve statistical significance because the denominator of the test statistic (standard error) is tiny. Interpretation: there IS a real difference between the groups (very unlikely to be chance), but the difference is too small to be practically meaningful. Before acting on this finding, consider: is this effect large enough to matter in the real world? For most applied decisions, d = 0.1 would not justify intervention. This is why effect sizes matter — p-value alone would have led to overinterpretation.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

For samples of n ≥ 20 per group, d and g are essentially identical — use whichever your statistical software or journal prefers. For smaller samples, Hedges' g is preferred because it corrects the upward bias of Cohen's d. Most modern statistical packages report g by default or offer it as an option. In practice, the distinction matters mainly for meta-analyses combining small studies, where systematic bias would accumulate. For individual studies with moderate-to-large samples, either is fine — just be consistent in your reporting.

Yes. Snap a photo of any problem involving two-group comparisons, ANOVA, correlations, or chi-square tests and StatsIQ calculates the appropriate effect size (d, g, η², r, φ, or odds ratio), interprets the magnitude, provides the 95% confidence interval, and converts between different effect size measures when needed. It also flags when p-values and effect sizes tell conflicting stories (significant but trivial, or non-significant but meaningful).

More Study Guides