Type I and Type II Errors Explained: Power, Sample Size, and the Trade-Off
Understand the two kinds of mistakes in hypothesis testing, how they relate to each other, what statistical power actually means, and how sample size affects your ability to detect real effects.
What You'll Learn
- ✓Define Type I and Type II errors and explain the consequences of each
- ✓Explain the relationship between alpha, beta, and statistical power
- ✓Describe how sample size, effect size, and alpha level affect power
- ✓Perform a basic power or sample size calculation for a one-sample test
1. The Two Kinds of Mistakes
Every hypothesis test can end in one of four outcomes: correctly rejecting a false null hypothesis (good), correctly failing to reject a true null (good), rejecting a true null hypothesis (Type I error, bad), or failing to reject a false null hypothesis (Type II error, bad). Understanding these four outcomes is the foundation of everything else in this guide. A Type I error is a false positive — you conclude there is an effect when there actually is not one. If a drug trial concludes the drug works when it actually does nothing, that is a Type I error. Patients get a useless drug, resources are wasted, and the scientific literature gets polluted with a false finding. The probability of a Type I error is alpha (α), which you set before the test. When you choose α = 0.05, you are accepting a 5% chance of rejecting the null hypothesis when it is actually true. A Type II error is a false negative — you fail to detect a real effect. If the drug actually works but your study concludes it does not, that is a Type II error. Patients miss out on an effective treatment. The probability of a Type II error is beta (β). Unlike alpha, you do not set beta directly — it depends on sample size, effect size, and your chosen alpha level.
Key Points
- •Type I error (false positive): rejecting a true null hypothesis. Probability = alpha (α).
- •Type II error (false negative): failing to reject a false null hypothesis. Probability = beta (β).
- •Alpha is set by the researcher before testing (typically 0.05). Beta depends on study design.
- •You cannot make both errors simultaneously — you either reject or fail to reject, and the null is either true or false.
2. Why You Cannot Eliminate Both Errors Simultaneously
Here is the fundamental trade-off: reducing the chance of one error increases the chance of the other, holding everything else constant. If you make alpha very small (say 0.001 instead of 0.05), you are being extremely conservative about false positives. But that conservatism means you need much stronger evidence to reject the null, which makes it harder to detect real effects — increasing your Type II error rate. Think of it like a smoke detector. You can make it very sensitive (low threshold), which means it catches every real fire but also goes off when you burn toast (lots of false positives, low alpha but high false alarm rate is a crude analogy). Or you can make it less sensitive (high threshold), which means fewer false alarms but a higher chance of missing a small fire (fewer false positives but more false negatives). The only way to reduce both errors simultaneously is to increase sample size. A larger sample gives you more information, which means you can maintain strict alpha levels while still having enough statistical power to detect real effects. This is why sample size planning is so important in research design — it is the lever that lets you control both error rates at acceptable levels.
Key Points
- •Decreasing alpha (stricter about Type I) increases beta (more likely to miss real effects), all else equal
- •The only way to reduce both errors simultaneously is to increase sample size
- •The trade-off between error types reflects a fundamental limitation of inference from finite data
- •Different fields resolve this trade-off differently — medical trials use strict alpha, exploratory research may tolerate higher alpha
3. Statistical Power: Your Ability to Detect Real Effects
Power is the probability of correctly rejecting a false null hypothesis. In other words, it is the probability of detecting a real effect when one exists. Mathematically, power = 1 - β. If your Type II error rate is 0.20 (20% chance of missing a real effect), your power is 0.80 (80% chance of detecting it). The conventional target for power is 0.80 (80%), meaning you accept a 20% chance of missing a real effect. Some fields aim for 0.90, especially in confirmatory or high-stakes studies. Anything below 0.50 means your study is more likely to miss the effect than to find it, which is a waste of resources. Power depends on four factors: sample size (larger samples increase power), effect size (larger effects are easier to detect), alpha level (higher alpha increases power but also increases Type I error risk), and variability in the data (less variability makes effects easier to see against the noise). Here is the practical implication: if you run a study with low power and get a non-significant result, you cannot conclude the effect does not exist. You can only say you did not detect it — which might be because it is not there, or because your study was not powerful enough to find it. This distinction between absence of evidence and evidence of absence is one of the most important concepts in applied statistics.
Key Points
- •Power = 1 - β = probability of detecting a real effect when one exists
- •The conventional minimum power target is 0.80 (80%)
- •Power increases with larger sample size, larger effect size, higher alpha, and lower variability
- •A non-significant result from a low-powered study does not prove the effect is absent — it may just be undetectable with that sample
4. Sample Size and Power: A Worked Example
Suppose you want to test whether a tutoring program raises exam scores. You expect the effect size to be about 5 points on a 100-point exam, and historical data shows the standard deviation of exam scores is 15 points. You want power = 0.80 at alpha = 0.05 using a two-sided one-sample z-test. The sample size formula for a one-sample z-test is: n = ((z_alpha/2 + z_beta) × σ / δ)², where z_alpha/2 = 1.96 (for α = 0.05, two-sided), z_beta = 0.84 (for power = 0.80), σ = 15 (standard deviation), and δ = 5 (expected effect size). n = ((1.96 + 0.84) × 15 / 5)² = (2.80 × 3)² = 8.4² = 70.56, so you need at least 71 students. Now see what happens when you change the inputs. If the expected effect is only 3 points instead of 5: n = ((2.80) × 15/3)² = (2.80 × 5)² = 14² = 196. Smaller effects require dramatically more data to detect. If you increase power to 0.90: z_beta becomes 1.28, so n = ((1.96 + 1.28) × 15/5)² = (3.24 × 3)² = 9.72² = 94.5, requiring 95 students. The general lesson: sample size requirements grow quadratically. Cutting the detectable effect size in half roughly quadruples the required sample. This is why researchers must be realistic about the smallest effect worth detecting — chasing tiny effects requires enormous samples. StatsIQ has a built-in power calculator that lets you explore these trade-offs interactively.
Key Points
- •Sample size for a z-test: n = ((z_alpha/2 + z_beta) × σ / δ)²
- •Halving the detectable effect size roughly quadruples the required sample — sample size grows quadratically
- •Always calculate required sample size before collecting data, not after
- •Under-powered studies waste resources — they are unlikely to detect real effects and contribute ambiguous results
5. Real-World Consequences and How Different Fields Handle Errors
The consequences of each error type vary dramatically by context, and different fields have adapted their alpha levels accordingly. In criminal justice, the system is designed to minimize Type I errors (convicting an innocent person). The standard of proof is beyond reasonable doubt — a very low alpha. The trade-off is more Type II errors (guilty people going free), which the system accepts because wrongful conviction is considered worse than wrongful acquittal. In medical screening, the priority is reversed. A cancer screening test should minimize Type II errors (missing a real cancer). High sensitivity means more false positives (Type I errors), which lead to unnecessary follow-up tests — costly and stressful, but far less harmful than missing a treatable cancer. In particle physics, the standard for claiming a discovery is 5 sigma (roughly alpha = 0.0000003). This extreme conservatism reflects the field's experience with false discoveries and the difficulty of replication. In contrast, social science research has historically used alpha = 0.05, though there is growing movement toward stricter thresholds after the replication crisis revealed that many published findings were likely false positives from under-powered studies. The takeaway: alpha = 0.05 is a convention, not a law of nature. The right error trade-off depends on the consequences of each type of mistake in your specific context.
Key Points
- •Different fields choose different alpha levels based on the relative costs of Type I and Type II errors
- •Criminal justice prioritizes avoiding false convictions (low alpha). Medical screening prioritizes avoiding missed diagnoses (low beta).
- •Particle physics uses 5-sigma (alpha ≈ 3 × 10⁻⁷). Social science uses alpha = 0.05 but is moving toward stricter standards.
- •Always consider the practical consequences of both error types before choosing your significance level
Key Takeaways
- ★Type I = false positive (alpha). Type II = false negative (beta). Power = 1 - beta.
- ★Reducing alpha increases beta unless you also increase sample size
- ★Power depends on four factors: sample size, effect size, alpha, and variability
- ★Sample size grows quadratically with the inverse of effect size — detecting half the effect requires four times the sample
- ★A non-significant result from an under-powered study is inconclusive, not evidence of no effect
Practice Questions
1. A study has alpha = 0.05 and power = 0.90. What is the probability of a Type II error?
2. A researcher wants to detect a 10-point difference (σ = 20) with 80% power at alpha = 0.05 (two-sided). What sample size is needed?
FAQs
Common questions about this topic
It depends entirely on the context. In medical testing, a Type II error (missing a disease) can be fatal. In drug approval, a Type I error (approving an ineffective drug) wastes resources and may cause side effects without benefit. There is no universal answer — you must consider the consequences of each error in your specific application.
Yes. StatsIQ includes an interactive power calculator and generates problems that ask you to determine required sample sizes, identify under-powered studies, and reason about the trade-offs between alpha, beta, effect size, and sample size.