advancedintermediate20-25 min

A-Priori vs Post-Hoc Power Analysis: Sample-Size Planning Done Right

A-priori power analysis tells you how many subjects to collect; post-hoc power analysis is widely misused. Here is exactly how each works, what observed power actually means, and four worked sample-size calculations across t-test, ANOVA, correlation, and chi-square.

What You'll Learn

  • Compute required sample size for a chosen test, effect size, alpha, and power target.
  • Explain why post-hoc “observed power” is a deterministic function of the p-value.
  • Plan a study around a minimum effect of practical interest, not just whatever the literature reports.

1. Direct Answer: Plan A-Priori, Avoid Observed Post-Hoc

A-priori (or “prospective”) power analysis is done BEFORE collecting data. You specify the test, the smallest effect size you care about, alpha (typically 0.05), and a target power (typically 0.80). The output is the minimum sample size that gives the test at least that much chance of detecting that effect. This is the version reviewers, funders, and pre-registration platforms expect. Post-hoc power analysis comes in two flavors. Sensitivity power analysis is legitimate — given the sample size you have, what is the smallest effect you could detect at 80% power? That is a useful diagnostic. Observed (“achieved”) power, computed from the effect size actually estimated in your data, is misleading and widely discouraged. With p < α the observed power is always above 50% by construction; with p > α it is below 50%. It is a deterministic function of the p-value, not new information. If you must report something post-hoc, use a confidence interval for the effect.

Key Points

  • A-priori: required sample size given a planning effect size, alpha, and power target.
  • Sensitivity: minimum detectable effect given your sample size.
  • Observed power from the data’s own effect size adds nothing beyond the p-value.

2. The Four Knobs of Power

For any parametric test, power depends on four quantities, and fixing three pins down the fourth. (1) Alpha — the type I error rate; smaller alpha lowers power. (2) Sample size n — power rises as n grows. (3) Effect size — power rises with effect size; the right one depends on the test (Cohen’s d for two-sample t, f for ANOVA, w for chi-square, r for correlation). (4) Power — typically targeted at 0.80 (so type II error rate β = 0.20). Pick three, solve for the fourth. A-priori solves for n. Sensitivity solves for the effect size. Doubling your sample size does not double your power; for a t-test, sample size needs to roughly quadruple to halve the minimum detectable effect.

Key Points

  • Alpha, n, effect size, and power form a system: fix three, derive the fourth.
  • Effect-size scale differs by test; do not confuse Cohen’s d with f or w.
  • Sample size scales like 1/effect² — small effects need huge samples.

3. Worked Example 1: Independent-Samples t-Test

You want to detect a Cohen’s d = 0.5 (medium) effect with alpha = 0.05 two-sided and power = 0.80. The required sample size per group from a non-central t calculation is approximately n = 64 per group, or 128 total. Drop the effect size to d = 0.3 and required n per group jumps to about 175 (350 total). Drop to d = 0.2 and you need about 394 per group (788 total). Notice the quadratic relationship: cutting d in half quadruples the required sample. To shortcut without software: n_per_group ≈ 16 / d² for power 0.80 and alpha 0.05 two-sided. d = 0.5 → n ≈ 64; d = 0.3 → n ≈ 178; d = 0.2 → n ≈ 400. The formula’s slight overestimate at large d is conservative and useful for planning.

Key Points

  • Shortcut: n_per_group ≈ 16 / d² for α = 0.05, power = 0.80, two-sided t.
  • d = 0.5 → 64/group; d = 0.3 → 175/group; d = 0.2 → 394/group.
  • Halving the effect quadruples the required sample.

4. Worked Example 2: One-Way ANOVA

You plan an ANOVA with k = 4 groups and an expected Cohen’s f = 0.25 (medium). At alpha = 0.05 and power = 0.80 the required total sample is about n = 180 (45 per group). Smaller f = 0.10 requires roughly n = 1,096 total. Cohen’s f relates to η² as f = √(η² / (1 − η²)), so f = 0.10 → η² = 0.010, f = 0.25 → η² = 0.059, f = 0.40 → η² = 0.138. Planning around η² instead of f tends to look more reasonable to subject-matter readers but the underlying calculation is identical.

Key Points

  • k = 4 groups, f = 0.25, α = 0.05, power = 0.80 → n ≈ 180 total.
  • f and η² interconvert; report the one your audience uses.
  • Adding groups raises required n; budget accordingly.

5. Worked Example 3: Correlation Test

You expect Pearson r = 0.30 (moderate) in the population. At alpha = 0.05 two-sided and power = 0.80 the required sample is about n = 84. For r = 0.20 the required sample doubles to about 194; for r = 0.10 it explodes to about 783. Use Fisher’s z transformation to do the calculation by hand: z = 0.5 × ln[(1 + r) / (1 − r)]. The standard error of z is 1/√(n − 3), and the required z* to achieve power = 0.80 at alpha = 0.05 two-sided is z* = (1.96 + 0.842) = 2.802. Solve for n: n = 3 + (2.802 / z_r)². For r = 0.30, z_r = 0.310, so n = 3 + (2.802 / 0.310)² = 3 + 81.7 = 84.7.

Key Points

  • Required n: 84 (r = 0.30), 194 (r = 0.20), 783 (r = 0.10).
  • Fisher’s z transform makes the calculation explicit: n = 3 + (2.802/z_r)².
  • Small correlations are vastly more expensive to detect than they look.

6. Worked Example 4: Chi-Square Goodness-of-Fit

A chi-square test with df = 4 and Cohen’s w = 0.20 (small-to-medium) at alpha = 0.05 and power = 0.80 requires about n = 200 observations. Cohen’s w = √(Σ((p_i − p_0i)² / p_0i)), where p_0i is the null proportion in cell i and p_i is the expected alternative. Many planning exercises start with the alternative cell proportions and compute w directly. For df = 1 (a 2×2 table) the same w needs n ≈ 197; for df = 8, n ≈ 219. Df enters the calculation but more weakly than w or alpha.

Key Points

  • Chi-square df = 4, w = 0.20, α = 0.05, power = 0.80 → n ≈ 200.
  • w computed directly from null vs alternative cell probabilities.
  • Df has a weaker effect on required n than w or alpha.

7. How to Pick the Planning Effect Size

The single most consequential decision is the planning effect size, and three principles apply. First, the smallest effect of PRACTICAL interest — what does your audience need the effect to be to act on it? — beats reporting-driven choices. Second, use prior meta-analytic estimates discounted by 25–50% for publication-bias inflation, because individual published studies systematically overestimate effects. Third, run the analysis at two or three plausible effect sizes and report the corresponding sample sizes; reviewers and funders prefer a sensitivity range to a single point estimate. Avoid copying Cohen’s rules of thumb (small/medium/large) for fields where they do not apply — in clinical pharmacology a “small” d = 0.2 may be the largest realistic effect, while in psychophysics a “medium” d = 0.5 is unusually small.

Key Points

  • Anchor planning effect on practical importance, not published averages.
  • Discount meta-analytic effects 25–50% for publication-bias inflation.
  • Report sample size at multiple plausible effect sizes — a sensitivity ladder.

8. Running a Power Analysis in StatsIQ

Pick the test (t, ANOVA, correlation, chi-square, regression), enter alpha, target power, and the planning effect size, and StatsIQ returns the required sample size. Flip the inputs to run sensitivity power analysis — given the n you can afford, what is the smallest effect detectable at 80% power? Observed/achieved power is deliberately not foregrounded because it is a deterministic function of the p-value and adds no information. This content is for educational purposes only.

Key Points

  • A-priori, sensitivity, and required-effect modes for every common test.
  • Reports n at multiple effect sizes by default for sensitivity tables.
  • Observed/achieved power is intentionally deprioritized as uninformative.

Key Takeaways

  • Power = 1 − β, the probability of detecting a real effect; conventional target 0.80.
  • t-test shortcut: n_per_group ≈ 16/d² for α = 0.05 two-sided, power = 0.80.
  • ANOVA: f = √(η²/(1−η²)); f = 0.10 small, 0.25 medium, 0.40 large.
  • Correlation: n ≈ 3 + (2.802 / z_r)² via Fisher’s z transformation.
  • Observed power is a deterministic function of the p-value; reviewers reject it.

Practice Questions

1. You can only afford n = 50 per group for an independent t-test. What is the smallest Cohen’s d detectable at α = 0.05 two-sided and power = 0.80?
Solve 16/d² ≈ 50 → d² ≈ 0.32 → d ≈ 0.57. So with 50 per group you can detect a medium-to-large effect at 80% power. Anything smaller risks under-powered conclusions.
2. A published study reports d = 0.80 (large) with n = 30 per group and p < 0.001. Should you plan a replication at d = 0.80?
No — published effects are systematically inflated by publication bias and small-study effects. Discount to d = 0.4–0.5 for planning, which raises required sample size from about 26 per group to 64–100 per group. Plan for the smaller realistic effect to avoid replication failure.
3. Why does a reviewer push back when you cite achieved power = 0.62 after a non-significant result?
Because observed power is a deterministic function of the p-value (p > α implies observed power < 0.5 for a single-tailed framing, with similar arithmetic two-sided). It adds nothing beyond the p-value, but it FEELS like new information — that mismatch makes reviewers reject it. Report a confidence interval for the effect size instead.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

It is the convention but not a rule. For high-stakes confirmatory work (Phase III trials, regulatory submissions) 90% power is common. For exploratory or screening work 70% may be acceptable. The right target is the level at which the cost of a missed real effect equals the cost of a larger sample. A power of 80% balances the two for most academic research; industry decisions involving large downstream commitments often justify 90%.

Use the asymptotic relative efficiency (ARE) of the non-parametric test versus its parametric counterpart. Wilcoxon signed-rank: ARE = 0.955 vs paired t. Mann-Whitney U: ARE = 0.955 vs two-sample t. Kruskal-Wallis: ARE ≈ 0.95 vs one-way ANOVA. Multiply the parametric required n by 1/ARE ≈ 1.05 to get the non-parametric requirement under normality. Under heavy-tailed alternatives the non-parametric test is more efficient and the required n is lower.

Yes when the family of tests is pre-specified. The simplest approach is to power each individual comparison at alpha / m (Bonferroni-style); a more powerful approach uses Holm-Bonferroni or BH-FDR thresholds, which need slightly larger samples than uncorrected single-test power but smaller than full Bonferroni. Software like G*Power 3 handles this in the “multiple comparisons” module.

G*Power 3.1 (free, GUI) is the most widely cited tool and covers t, ANOVA, regression, chi-square, and correlation. R packages pwr (basic) and Superpower (factorial designs, simulations) handle complex cases. Python has statsmodels.stats.power for common tests and pingouin for a richer API. PASS (commercial) covers obscure designs needed in pharma. Pre-register the inputs you used so reviewers can audit.

For mixed-effects and multi-level designs, simulation-based power is the standard (Monte Carlo across plausible effect sizes and correlation structures), and StatsIQ provides a simulation harness with sensible defaults. For simple t, ANOVA, correlation, and chi-square the closed-form non-central distribution is used. This content is for educational purposes only.

Related Study Guides

Browse All Study Guides