⚑
fundamentalsintermediate30 min

Statistical Power and Sample Size: Beating Type II Error

Power is the probability of detecting a real effect. Sample size planning uses power analysis to decide how much data you need. This guide walks through the full relationship between alpha, beta, effect size, and sample size with worked examples.

What You'll Learn

  • βœ“Define power and its relationship to Type II error.
  • βœ“Identify the four quantities linked by power analysis: alpha, beta, effect size, sample size.
  • βœ“Calculate required sample size for a t-test, ANOVA, and correlation.
  • βœ“Understand the consequences of underpowered studies.
  • βœ“Use effect size measures (Cohen's d, Ξ·Β², fΒ², r) in sample size planning.

1. What Power Actually Measures

Power is the probability of correctly rejecting the null hypothesis when a true effect exists. In equation form: power = 1 βˆ’ Ξ², where Ξ² (beta) is the probability of Type II error (failing to reject a false null). Type I error (alpha, Ξ±): rejecting a true null. You conclude there is an effect when there isn't one. Convention: Ξ± = 0.05 (5% false positive rate). Type II error (beta, Ξ²): failing to reject a false null. You miss a real effect. Convention: Ξ² = 0.20 (20% chance of missing a true effect), giving power = 0.80 (80% chance of detecting a true effect). Why power matters: a study that lacks power is likely to produce false negatives. It's a waste of time and money if conducted anyway, and its null result is uninterpretable (you can't conclude "no effect" from a non-significant result in an underpowered study). Power analysis is the planning step that prevents this. The four linked quantities: sample size (n), alpha (Ξ±), effect size (d, Ξ·Β², fΒ², r, etc.), and power (1 βˆ’ Ξ²) are mathematically linked. Fix three, and the fourth is determined. This is the foundation of sample size planning β€” specify the effect size you want to detect, the alpha you'll use, and the power you want, and the math tells you how many subjects you need.

Key Points

  • β€’Power = probability of detecting a true effect = 1 βˆ’ Type II error rate.
  • β€’Convention: 80% power, 5% alpha.
  • β€’Alpha, beta, effect size, and sample size are mathematically linked.
  • β€’Underpowered studies cannot distinguish "no effect" from "insufficient data to detect an effect."
  • β€’Planning sample size with power analysis prevents wasted studies.

2. Effect Size: The Critical Input to Power Analysis

Effect size quantifies the magnitude of an effect independent of sample size. A large effect is easy to detect; a small effect requires many more observations. Common effect size measures: Cohen's d (for comparing two means): d = (mean₁ βˆ’ meanβ‚‚) / pooled SD - Small effect: d = 0.2 - Medium effect: d = 0.5 - Large effect: d = 0.8 - Example: if group means differ by 5 points and pooled SD is 10, d = 0.5 (medium). Eta squared (Ξ·Β²) for ANOVA: Ξ·Β² = SS_effect / SS_total, the proportion of variance explained. - Small: 0.01 - Medium: 0.06 - Large: 0.14 - Example: if factor explains 10% of variance, Ξ·Β² = 0.10 (between medium and large). Cohen's fΒ² for multiple regression: fΒ² = RΒ² / (1 βˆ’ RΒ²), measuring unique variance explained. - Small: 0.02 - Medium: 0.15 - Large: 0.35 Correlation coefficient (r) for correlation tests: - Small: 0.10 - Medium: 0.30 - Large: 0.50 Phi or CramΓ©r's V (for chi-square): - Small: 0.10 - Medium: 0.30 - Large: 0.50 Choosing effect size for planning: 1. From prior similar research: use published effect sizes in your field. 2. Minimum effect of interest: smallest effect that would be practically meaningful to detect. Often more honest than "what do I expect." 3. Cohen's conventions (small/medium/large) as defaults if no other information is available. 4. Pilot data can provide an estimated effect size, but small pilots give unreliable effect size estimates. The critical insight: if you use Cohen's conventions, pick based on what size effect is worth detecting. Don't plan for a large effect just because it makes the sample size small β€” if the true effect is smaller than your planning assumption, you'll be underpowered.

Key Points

  • β€’Effect size is the magnitude of an effect, independent of sample size.
  • β€’Cohen's d, Ξ·Β², fΒ², r, and CramΓ©r's V are effect sizes for different tests.
  • β€’Cohen's conventions (small/medium/large) are defaults when no other data is available.
  • β€’Plan for the minimum effect worth detecting, not the effect you hope for.
  • β€’Smaller true effects require vastly more data to detect.

3. Worked Example 1: Independent-Samples t-test

Research question: does a new teaching method improve exam scores compared to standard lecture? Plan an independent-samples t-test. Inputs: - Expected effect size: d = 0.5 (medium β€” based on prior literature) - Alpha: 0.05 (two-tailed) - Desired power: 0.80 - Test: two-sample t-test, equal n per group Formula for approximate required n per group: n per group β‰ˆ ((z_Ξ±/2 + z_Ξ²)Β² Γ— 2) / dΒ² With Ξ± = 0.05 (two-tailed), z_Ξ±/2 = 1.96. With power = 0.80, Ξ² = 0.20, z_Ξ² = 0.84. n per group β‰ˆ ((1.96 + 0.84)Β² Γ— 2) / 0.5Β² β‰ˆ (2.80Β² Γ— 2) / 0.25 β‰ˆ (7.84 Γ— 2) / 0.25 β‰ˆ 15.68 / 0.25 β‰ˆ 62.72 Round up: n = 63 per group, N = 126 total. Verification with software: G*Power with inputs (t-test, two independent samples, d = 0.5, Ξ± = 0.05, power = 0.80) gives n = 64 per group. The small discrepancy comes from software using t-distribution critical values rather than normal approximation. What happens if the effect is smaller than planned: - d = 0.3 instead of 0.5: required n per group = ((2.80)Β² Γ— 2) / 0.09 β‰ˆ 174 per group - d = 0.2 instead of 0.5: required n per group β‰ˆ 393 per group This illustrates why overestimating effect size is dangerous β€” you end up underpowered. If you plan for d = 0.5 and the truth is d = 0.3, your 63-per-group study has only about 38% power, meaning you're more likely to miss the effect than detect it. Rule of thumb: when in doubt, assume a smaller effect and plan for a larger sample. Or report confidence intervals around your estimated effect, which remain interpretable regardless of whether the result reaches significance.

Key Points

  • β€’Two-sample t-test with d = 0.5, Ξ± = 0.05, power = 0.80 requires ~63 per group.
  • β€’Halving the effect size roughly quadruples the required n.
  • β€’Overestimating effect size at the planning stage leaves you underpowered in reality.
  • β€’G*Power or similar software gives exact calculations.

4. Worked Example 2: One-Way ANOVA

Research question: does exam score differ across 4 teaching methods? Inputs: - Expected effect size: f = 0.25 (medium) - Alpha: 0.05 - Desired power: 0.80 - Test: one-way ANOVA with 4 groups Cohen's f for ANOVA is the standard deviation of group means divided by within-group SD. It's related to eta-squared by: fΒ² = Ξ·Β² / (1 βˆ’ Ξ·Β²). Medium f = 0.25 corresponds to Ξ·Β² β‰ˆ 0.06. Using G*Power (one-way ANOVA, f = 0.25, Ξ± = 0.05, power = 0.80, 4 groups): Required total N β‰ˆ 180, so ~45 per group. As effect size varies: - f = 0.10 (small): total N β‰ˆ 1200, ~300 per group - f = 0.25 (medium): total N β‰ˆ 180, ~45 per group - f = 0.40 (large): total N β‰ˆ 76, ~19 per group Key observation: the number of groups also matters. More groups = more total N required, even at the same per-group n. A 6-group ANOVA at f = 0.25 needs ~216 total, not just 180 Γ— 6/4. Considerations: - Unbalanced designs: use the harmonic mean of planned group sizes in power calculations. - Unequal variances: adjust for the violation or use Welch's ANOVA (slightly lower effective power). - Post-hoc pairwise comparisons inflate the effective sample size needed if you plan multiple tests β€” apply Bonferroni or Holm correction.

Key Points

  • β€’One-way ANOVA with f = 0.25, 4 groups, Ξ± = 0.05, power = 0.80 needs ~45 per group.
  • β€’More groups require more total N.
  • β€’Small f (0.10) requires dramatically larger samples than medium f (0.25).
  • β€’Post-hoc tests require additional sample size planning for corrected alpha.

5. Worked Example 3: Correlation

Research question: is hours studied correlated with exam score? Inputs: - Expected effect size: r = 0.30 (medium) - Alpha: 0.05 (two-tailed) - Desired power: 0.80 - Test: Pearson correlation Fisher's z transformation is used for power calculations on correlations. Required N can be approximated as: N β‰ˆ ((z_Ξ±/2 + z_Ξ²) / Fisher_z(r))Β² + 3 Fisher_z(0.30) = 0.5 Γ— ln((1 + 0.30) / (1 βˆ’ 0.30)) = 0.5 Γ— ln(1.857) = 0.5 Γ— 0.619 = 0.310 N β‰ˆ ((1.96 + 0.84) / 0.310)Β² + 3 β‰ˆ (2.80 / 0.310)Β² + 3 β‰ˆ (9.03)Β² + 3 β‰ˆ 81.6 + 3 β‰ˆ 85 So detecting r = 0.30 with 80% power at Ξ± = 0.05 (two-tailed) requires approximately 85 observations. Sensitivity: - r = 0.10: need ~782 observations - r = 0.30: need ~85 observations - r = 0.50: need ~30 observations Correlation is particularly sensitive to sample size in power calculations because the variability of r depends on r itself β€” smaller correlations have much larger sampling variance relative to their magnitude. Effect size inflation warning: small samples produce noisy estimates of r. An r = 0.40 observed in n = 20 has a 95% CI of approximately (βˆ’0.04, 0.72) β€” the population r could easily be 0 or could be 0.70. Small studies often overestimate effect sizes. Planning future studies based on small-sample effect sizes frequently leads to underpowered replications.

Key Points

  • β€’Correlation r = 0.30 requires ~85 observations at 80% power, Ξ± = 0.05.
  • β€’Smaller correlations require dramatically more data (r = 0.10 needs ~782).
  • β€’Fisher z transformation is used for correlation power calculations.
  • β€’Small samples produce unreliable effect size estimates; confidence intervals are wide.

6. Common Mistakes in Power Analysis

Mistake 1: running a post-hoc power calculation based on the observed effect size. This is statistically meaningless. If you found a non-significant result, post-hoc power calculated from your effect size will always look low β€” you're essentially computing the p-value again in different units. Use a priori power analysis (planning before data collection) based on expected or minimum-meaningful effect sizes, not post-hoc analysis based on observed effects. Mistake 2: using effect sizes from small pilot studies. Small pilots produce highly variable effect size estimates. An r = 0.50 in a pilot of n = 10 might truly be r = 0.20 or r = 0.70 in the population. Don't scale your main study based on a small pilot's point estimate β€” either use the lower confidence bound of the pilot, or use theoretical expected effect sizes instead. Mistake 3: assuming a large effect without justification. "I expect a large effect because the intervention is strong" is not justification. Unless you have prior similar research showing large effects, default to medium (or small for novel interventions). Overly optimistic effect size assumptions are the leading cause of underpowered studies. Mistake 4: ignoring attrition. If 20% of participants will drop out, your planned n = 100 becomes effective n = 80. Account for attrition: plan enough participants so that post-attrition n meets power requirements. Mistake 5: multiple comparisons not built into the calculation. If you plan 5 comparisons, each at Ξ± = 0.05, your family-wise error rate is much higher than 5%. Apply Bonferroni (divide alpha by number of tests) at the planning stage, which increases required sample size. A Bonferroni-corrected alpha of 0.01 requires ~40% more sample than Ξ± = 0.05 for the same power. Mistake 6: treating power analysis as a one-time calculation. As effect sizes, dropout rates, and design details evolve during pilot and early data collection, update your sample size plan. Many studies discover partway through that their design assumptions were wrong and need revision. Mistake 7: running it once in software and not checking. G*Power and similar tools can give very different results based on which specific test you select. Two-sample t-test, one-sample t-test, paired t-test, and Welch's t-test all have different power formulas. Confirm you're using the correct test option.

Key Points

  • β€’Never use post-hoc power analysis to justify a null result.
  • β€’Small pilots give unreliable effect size estimates β€” use theoretical expectations instead.
  • β€’Default to medium or small effect size assumptions unless strong prior data supports large.
  • β€’Account for dropout rate (plan for effective post-attrition n).
  • β€’Adjust alpha for multiple comparisons at the planning stage.
  • β€’Re-run power analysis as study design evolves.

Key Takeaways

  • β˜…Power = 1 βˆ’ Ξ²; conventional target is 0.80.
  • β˜…Alpha, beta, effect size, and n are mathematically linked β€” fix three, the fourth is determined.
  • β˜…Cohen's conventions: d = 0.2/0.5/0.8 (small/medium/large); Ξ·Β² = 0.01/0.06/0.14; r = 0.10/0.30/0.50.
  • β˜…Halving the effect size roughly quadruples the required sample size.
  • β˜…Underpowered studies produce uninterpretable null results.
  • β˜…Post-hoc power analysis on observed effects is statistically meaningless.

Practice Questions

1. Two-sample t-test, d = 0.4, Ξ± = 0.05 (two-tailed), power = 0.80. Approximate required n per group?
n β‰ˆ ((1.96 + 0.84)Β² Γ— 2) / 0.4Β² = (2.80Β² Γ— 2) / 0.16 = 15.68 / 0.16 β‰ˆ 98 per group.
2. If power = 0.60, what is the Type II error rate?
Ξ² = 1 βˆ’ 0.60 = 0.40. There is a 40% chance of missing a true effect.
3. Why is 80% power conventional?
Balances practicality and reliability. Higher power (90%, 95%) requires much larger samples; lower power (60%) leaves too many false negatives. 80% is widely accepted as adequate for most research without being excessive in sample requirements.
4. A study reports "insufficient power" to detect an effect (p = 0.12). Can we conclude no effect exists?
No. Non-significant results in underpowered studies are uninterpretable β€” they could mean there is no effect, or there is a real effect that the study was too small to detect. Report the confidence interval for the effect size instead; if the CI is wide and includes meaningful values, the data is simply inconclusive.
5. Planned n = 100 with 20% expected dropout. What n should you actually recruit?
Plan for effective post-dropout n of 100. Required recruitment = 100 / (1 βˆ’ 0.20) = 100 / 0.80 = 125 participants.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

80% is the field convention. 90% or higher for high-stakes research (clinical trials, policy decisions) where missing true effects has serious consequences. 70% is sometimes acceptable for exploratory research with very difficult-to-obtain subjects. Below 70% is generally too low to be worth running.

Options: (1) prior published research in your field, (2) meta-analytic estimates, (3) theoretical justification from the mechanism, (4) minimum effect worth detecting (most honest), (5) Cohen's conventions as defaults. Pilot studies can inform effect size but small pilots give unreliable estimates β€” use the lower bound of the pilot's confidence interval rather than the point estimate.

A priori: computed before data collection using expected effect size, to determine required sample size. Meaningful and recommended. Post-hoc: computed after data collection using observed effect size. Generally meaningless β€” high observed effect = high "post-hoc power," low observed effect = low "post-hoc power," so it just tracks the p-value in different units. Don't use post-hoc power to interpret null results; use confidence intervals instead.

More predictors generally require larger samples for the same power on each coefficient. G*Power has specific calculators for multiple regression that account for the number of predictors (u) and the residual degrees of freedom. Rule of thumb: 10-20 observations per predictor as a minimum; more for smaller effects.

Limited options. (1) Use paired or within-subject designs where possible β€” paired tests are more powerful than independent tests for the same n. (2) Reduce measurement error (more reliable instruments increase effective power). (3) Control for covariates (ANCOVA reduces error variance). (4) Increase alpha (Ξ± = 0.10 instead of 0.05 gives more power but more Type I error). (5) Use one-tailed test when justified by strong directional hypothesis. Beyond these, n is the primary lever.

Yes. Describe your test (t-test, ANOVA, correlation, regression, chi-square), your expected effect size, alpha, and desired power. StatsIQ computes required sample size, suggests effect size conventions if you don't have an estimate, and flags common planning mistakes. Also handles more complex designs (repeated measures, mixed models, multiple comparisons). This content is for educational purposes only and does not constitute statistical advice.

More Study Guides