Type I vs Type II Errors: Worked Examples and Tradeoffs
A focused walkthrough of Type I and Type II errors in hypothesis testing: definitions, probability notation, the alpha/beta tradeoff, statistical power, and four worked examples in different fields (medical screening, A/B testing, manufacturing QA, criminal trials).
What You'll Learn
- โDefine Type I (alpha) and Type II (beta) errors in hypothesis testing
- โExplain the inverse tradeoff between alpha and beta
- โDefine statistical power as 1 โ beta
- โApply error analysis to medical, business, and quality control contexts
- โRecognize when each error type is more costly
1. Defining the Two Error Types
In any hypothesis test, two errors are possible. Type I error: rejecting the null hypothesis when it is actually true. False positive. Type II error: failing to reject the null hypothesis when the alternative is actually true. False negative. The two errors are NOT symmetric. Type I error rate (alpha) is set by the researcher in advance โ typically alpha = 0.05 means 5% chance of falsely rejecting H0. Type II error rate (beta) depends on the actual effect size, sample size, and chosen alpha โ it is computed, not set directly. Table of outcomes: | Reality | Decision: Reject H0 | Decision: Fail to Reject H0 | |---|---|---| | H0 True | Type I Error (alpha) | Correct (1 โ alpha) | | H0 False | Correct (power = 1 โ beta) | Type II Error (beta) | The four cells exhaust all possibilities. Two are correct decisions (lower-right and upper-right). Two are errors. The challenge is balancing the cost of each error type given the context. Which is "worse" depends entirely on context. In medical screening for a treatable cancer, missing the disease (Type II) is far worse than a false alarm (Type I, which can be resolved with confirmatory testing). In a courtroom, convicting an innocent person (Type I if H0 = innocent) is considered worse than letting a guilty person go (Type II). The researcher must understand the cost asymmetry to choose alpha and design sample sizes appropriately.
Key Points
- โขType I = false positive = reject true H0 (alpha)
- โขType II = false negative = fail to reject false H0 (beta)
- โขAlpha is researcher-chosen (typically 0.05)
- โขBeta depends on effect size, sample size, alpha
- โขPower = 1 โ beta = probability of correctly detecting an effect
2. The Alpha-Beta Tradeoff
Alpha and beta are inversely related when sample size and effect size are fixed. Lower alpha (stricter rejection criterion) raises beta (harder to detect a real effect). Higher alpha (laxer rejection criterion) lowers beta (easier to detect a real effect). Visually, this is the overlap between the null distribution (centered at H0) and the alternative distribution (centered at the true effect). The critical value separates "reject" from "fail to reject." Moving the critical value to the right (more conservative) reduces Type I error but increases Type II error. Moving it left does the opposite. The only way to reduce BOTH error types simultaneously is to increase sample size (which sharpens both distributions) or increase the true effect size (which moves the alternative distribution further from the null). These are the only two levers. Statistical power analyses use this principle: given a target effect size and desired alpha and power, compute the required sample size. Common defaults. Alpha = 0.05 is standard in most fields. Power = 0.80 is a common minimum target โ meaning 80% probability of detecting a real effect of the specified size. Achieving alpha = 0.05 and power = 0.80 for a moderate effect size typically requires sample sizes in the dozens to hundreds depending on the test.
Key Points
- โขAlpha and beta are inversely related (with fixed n and effect size)
- โขLower alpha โ higher beta and vice versa
- โขOnly way to reduce both: increase sample size or true effect size
- โขStandard defaults: alpha = 0.05, power = 0.80
- โขPower analysis: compute required n given target alpha and power
3. Four Worked Examples
Example 1: Medical Screening. A new blood test for a treatable cancer. H0: patient does not have cancer. H1: patient has cancer. Type I error: false positive โ patient is told they may have cancer, anxiety follows, confirmatory testing reveals the truth. Cost: anxiety + cost of confirmatory test. Type II error: false negative โ patient is told they are clear, but they have undetected cancer. Cost: delayed treatment, potential progression of disease. Type II is far more costly. Screening tests are designed with high sensitivity (low beta) at the cost of moderate specificity (higher alpha). The follow-up confirmatory test reduces the cost of false positives. Example 2: A/B Testing. New checkout page vs control. H0: new page has no impact on conversion. H1: new page changes conversion. Type I error: false positive โ launching a worse page based on noisy data. Cost: revenue loss, time to detect. Type II error: false negative โ failing to detect a real improvement. Cost: missed revenue uplift, slower experimentation cadence. Many teams treat these as roughly symmetric (alpha = 0.05, power = 0.80). High-traffic sites can afford larger sample sizes that reduce both error rates. Example 3: Manufacturing QA. Quality test for a batch of components. H0: batch meets specifications. H1: batch fails specifications. Type I error: false positive โ good batch rejected, scrapped or reworked unnecessarily. Cost: production loss + rework. Type II error: false negative โ bad batch accepted and shipped to customers. Cost: warranty claims, brand damage, potential safety issues. Type II is usually much more costly in safety-critical industries (aerospace, medical devices). QA processes use tight rejection criteria (low alpha for accepting bad batches = high alpha for falsely rejecting good batches). Example 4: Criminal Trials. H0: defendant is innocent. H1: defendant is guilty. Type I error: false positive โ convicting an innocent person. Type II error: false negative โ acquitting a guilty person. The "beyond reasonable doubt" standard sets a very low alpha (perhaps 0.01 conceptually). This raises beta โ some guilty people are acquitted. The justice system explicitly accepts this asymmetry because the cost of Type I error (wrongful imprisonment) is considered much higher than Type II (guilty person freed). This is the most explicit institutional encoding of the alpha-beta tradeoff.
Key Points
- โขMedical screening: Type II usually costlier (missed diagnosis)
- โขA/B testing: roughly symmetric, depends on team risk tolerance
- โขManufacturing QA: Type II usually costlier in safety-critical fields
- โขCriminal trials: Type I much costlier (wrongful conviction)
- โขContext determines which error is "worse"
4. Power and Sample Size Calculations
Power is the probability of correctly rejecting a false null hypothesis. Power = 1 โ beta. A test with 80% power has 80% chance of detecting a real effect of the specified size at the specified alpha level. Four drivers of power: (1) Effect size โ larger true effects are easier to detect. (2) Sample size โ larger samples sharpen the alternative distribution, improving power. (3) Alpha โ higher alpha (more liberal rejection criterion) increases power but raises Type I error rate. (4) Variability โ lower population variance increases power; higher variance reduces it. Sample size calculation. For a two-sample t-test detecting a medium effect (d = 0.5) with alpha = 0.05 and power = 0.80, the required sample size per group is approximately 64. For a small effect (d = 0.2), the required sample size is approximately 393 per group. For a large effect (d = 0.8), only 26 per group is needed. These rules of thumb come from standard power analysis tables; specific software (G*Power, R, Python statsmodels) computes exact values for various tests. Underpowered studies are a major source of false positives in published research. A study with 30% power that detects an effect has only a 30% chance of finding the true effect โ but if it does find an effect, the published result is likely an inflated estimate (because only the larger-than-true effects clear significance). This is the "winners curse" in research. Pre-registration and power calculation are standard mitigations.
Key Points
- โขPower = 1 โ beta = probability of correctly detecting an effect
- โขFour drivers: effect size, sample size, alpha, variability
- โขStandard target: 80% power
- โขFor d = 0.5, alpha = 0.05, power = 0.80: ~64 per group
- โขUnderpowered studies inflate effect estimates ("winners curse")
5. How StatsIQ Helps With Error Analysis
Snap a photo of any hypothesis test result or study design and StatsIQ computes the Type I and Type II error rates, identifies the alpha-beta tradeoff for the chosen test, and computes statistical power. For study design, StatsIQ runs power analyses given target alpha and power, computing the required sample size for any common test. For interpreting published results, StatsIQ identifies when reported effect sizes are likely inflated due to low power (the "winners curse" pattern). This content is for educational purposes only.
Key Points
- โขComputes Type I and Type II error rates
- โขIdentifies alpha-beta tradeoff
- โขRuns power analyses for study design
- โขComputes required sample size given alpha and power
- โขFlags inflated effect estimates from underpowered studies
Key Takeaways
- โ Type I = false positive = reject true H0 (probability = alpha)
- โ Type II = false negative = fail to reject false H0 (probability = beta)
- โ Power = 1 โ beta = probability of correctly detecting effect
- โ Alpha is researcher-chosen; beta depends on effect size, n, alpha
- โ Standard defaults: alpha = 0.05, power = 0.80
- โ Inverse tradeoff: lower alpha โ higher beta
- โ Only way to reduce both: increase n or true effect size
- โ Medical screening: Type II usually costlier (missed disease)
- โ Manufacturing QA: Type II usually costlier (defective product shipped)
- โ Criminal trials: Type I much costlier (wrongful conviction)
- โ For d = 0.5, alpha = 0.05, power = 0.80: ~64 per group
- โ Underpowered studies inflate effect estimates ("winners curse")
Practice Questions
1. A study has alpha = 0.05 and beta = 0.20. What is statistical power?
2. In a medical test for a fatal disease where early treatment is effective, which error type is more costly?
3. A researcher reduces alpha from 0.05 to 0.01 to be more conservative. What happens to beta (assuming everything else constant)?
4. Why are underpowered studies considered problematic even when they find statistically significant results?
5. A test has alpha = 0.05 and runs at 60% power. What is the probability of correctly NOT rejecting a true H0?
FAQs
Common questions about this topic
Mathematical convention dating to early 20th century statistics, particularly the work of Jerzy Neyman and Egon Pearson in the 1930s. They formalized the framework where the Type I error rate is set in advance (alpha) and the Type II error rate is computed given the design (beta). The Greek letter convention persists across all statistics textbooks. Power, the complement of beta, does not have a standard Greek letter and is simply called "power" or "1 โ beta."
Only by never rejecting H0 โ which gives 100% Type II error rate when H0 is false. Any decision rule that ever rejects H0 has a non-zero chance of doing so when H0 is true. Setting alpha = 0 is equivalent to "never declare a significant result," which makes hypothesis testing useless. The principle: you cannot drive Type I to zero without abandoning the ability to detect real effects.
Running many tests inflates the family-wise Type I error rate. With alpha = 0.05, running 20 independent tests on the same data yields approximately 64% chance of at least one false positive. Corrections like Bonferroni (divide alpha by number of tests) or Benjamini-Hochberg (control false discovery rate) address this. In A/B testing with many metrics, ignoring multiple testing inflates apparent wins. Power can suffer dramatically โ pre-specifying a single primary hypothesis is the standard mitigation.
Bayesian analysis reframes the question. Instead of rejecting/failing-to-reject H0, Bayesians report posterior probabilities of competing hypotheses. The frequentist Type I/Type II framework does not directly apply. However, Bayesian decision-theoretic frameworks can incorporate similar concepts via loss functions: the cost of choosing the wrong hypothesis given the actual state. Loss functions for "treat as different when actually same" and "treat as same when actually different" correspond conceptually to Type I and Type II costs.
Inverse and strong. Larger true effect sizes are easier to detect, so Type II error rate (beta) decreases as effect size grows. For a fixed sample size and alpha, halving the effect size approximately quadruples the required sample size to maintain the same beta. This is why effect-size estimation is critical before designing a study โ assuming a larger effect than actually exists leads to an underpowered study.
Snap a photo of any hypothesis test or study design and StatsIQ computes the Type I and Type II error rates, identifies the alpha-beta tradeoff, and computes statistical power. For study design, StatsIQ runs power analyses given target alpha and power, computing required sample size for any common test. The app also flags when reported effect sizes are likely inflated due to low power. This content is for educational purposes only.