Hypothesis Testing: The Complete Guide With 6 Worked Tests
A pillar guide to statistical hypothesis testing covering the seven-step framework, all major tests (z-test, t-test, paired and two-sample t-tests, ANOVA, chi-square, Mann-Whitney), p-values, type I and type II errors, statistical power, and a decision tree for choosing the right test. Includes worked examples for each test family and a power-and-sample-size lookup table.
What You'll Learn
- ✓Apply the seven-step hypothesis testing framework to any test scenario
- ✓Choose the correct test (z, t, paired t, ANOVA, chi-square, non-parametric) for the question and data
- ✓Distinguish one-tailed from two-tailed tests and the implications for p-values
- ✓Compute and interpret p-values, critical values, and confidence intervals consistently
- ✓Identify type I and type II errors and the trade-off controlled by alpha and power
- ✓Compute required sample size for a target power level using a published power table
1. Direct Answer: What Hypothesis Testing Is
Hypothesis testing is a statistical decision procedure for using sample data to evaluate two competing claims about a population: a null hypothesis (typically "no effect" or "no difference") and an alternative hypothesis (the research claim). The procedure computes a test statistic from the sample, compares it against a reference distribution under the null, and produces a p-value — the probability of observing a test statistic at least as extreme as the one observed if the null hypothesis were true. If the p-value is below a pre-specified significance level (alpha, typically 0.05), the null is rejected in favor of the alternative; otherwise the null is not rejected. Hypothesis testing does NOT prove the alternative is true. It only quantifies how surprising the observed data would be under the null. Modern practice pairs hypothesis testing with effect-size reporting and confidence intervals because a statistically significant result with negligible effect size is not practically meaningful.
Key Points
- •Two competing claims: null hypothesis (H0, no effect) and alternative hypothesis (H1, the claim)
- •The procedure quantifies how surprising the data is under H0 (the p-value)
- •Reject H0 when p-value < alpha (typically 0.05); otherwise do not reject
- •"Failing to reject H0" is not the same as "accepting H0" — absence of evidence is not evidence of absence
- •Modern practice always reports effect size and confidence intervals alongside p-values
2. The Seven-Step Framework
Every hypothesis test, regardless of test family, follows the same seven-step framework. Memorize this sequence and the workflow becomes automatic. 1. State the null and alternative hypotheses (H0 and H1) symbolically (e.g., H0: μ = 100 vs H1: μ ≠ 100). 2. Choose the significance level (alpha) — typically 0.05, sometimes 0.01 or 0.10 depending on the field and the cost of type I errors. 3. Select the appropriate test based on the data type, sample size, and assumption checks (covered in the decision tree below). 4. Compute the test statistic from the sample data using the test's formula. 5. Determine the p-value (or critical value) from the reference distribution. 6. Compare the p-value to alpha and make a reject / fail-to-reject decision. 7. State the conclusion in plain language tied to the original research question. A common mistake is collapsing step 1 into the conclusion — writing the alternative as "the new drug is better" instead of "μ_new > μ_old". Symbolic statements force precision and prevent ambiguous "improvement" claims that the data may not support.
Key Points
- •Always state hypotheses symbolically before collecting or analyzing data
- •Alpha is set BEFORE seeing the data — never adjusted to make a result significant
- •Reject decision is binary; the p-value provides the evidence quantification
- •Final conclusion ties back to the original research question in plain language
- •P-hacking (running tests until significance is found) violates the framework
3. Test Family 1: One-Sample z-test and t-test (Worked Example)
Use a one-sample test when comparing a sample mean to a known or hypothesized population value. Use the z-test when the population standard deviation (σ) is known (rare in practice); use the t-test when σ is unknown and must be estimated from the sample (the common case). Formula: z or t = (x̄ − μ_0) / (s / √n) For t-test, degrees of freedom = n − 1. Worked Example. A factory claims its production process produces parts averaging 100 mm in length. A sample of n = 25 parts has mean x̄ = 102 mm and standard deviation s = 5 mm. Test at alpha = 0.05 whether the process mean differs from 100 mm. 1. H0: μ = 100; H1: μ ≠ 100 (two-tailed). 2. alpha = 0.05. 3. One-sample t-test (σ unknown, sample is roughly normal). 4. t = (102 − 100) / (5 / √25) = 2 / 1 = 2.0. df = 24. 5. From the t-distribution, two-tailed p-value for t = 2.0 with df = 24 is approximately 0.057. 6. p-value (0.057) > alpha (0.05) — fail to reject H0. 7. Conclusion: At alpha = 0.05, we do not have sufficient evidence to conclude the process mean differs from 100 mm. The result is borderline; collecting more data or accepting alpha = 0.10 would change the conclusion. Borderline results like this one demonstrate why effect size and confidence intervals matter. The 95% CI for μ is x̄ ± t_critical × (s/√n) = 102 ± 2.064 × 1 = (99.94, 104.06) — narrowly includes 100, consistent with the fail-to-reject decision but suggesting the true mean is most likely above 100.
Key Points
- •One-sample test compares a sample mean to a known/hypothesized population value
- •Use z when σ is known (rare); use t when σ is estimated from the sample
- •Test statistic standardizes the deviation of x̄ from μ_0 by the standard error
- •Borderline p-values near alpha should always trigger effect-size and CI examination
- •Confidence interval and hypothesis test give consistent decisions when alpha = 1 − confidence level
4. Test Family 2: Two-Sample t-test (Worked Example)
Use a two-sample (independent samples) t-test to compare means from two unrelated groups. Variants exist for equal vs unequal variances; modern practice uses Welch's t-test (unequal variances) by default because it is robust to variance violations and reduces to Student's t-test when variances are equal. Formula (Welch): t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂) Worked Example. Two teaching methods are compared on student test scores. Method A: n₁ = 20, x̄₁ = 78, s₁ = 8. Method B: n₂ = 25, x̄₂ = 73, s₂ = 10. Test at alpha = 0.05 whether the methods differ. 1. H0: μ_A = μ_B; H1: μ_A ≠ μ_B (two-tailed). 2. alpha = 0.05. 3. Welch's two-sample t-test (independent groups, unequal variances assumed). 4. t = (78 − 73) / √(64/20 + 100/25) = 5 / √(3.2 + 4.0) = 5 / √7.2 = 5 / 2.683 = 1.864. Welch df ≈ 42.4. 5. Two-tailed p-value for t = 1.864 with df ≈ 42 is approximately 0.069. 6. p-value (0.069) > alpha (0.05) — fail to reject H0. 7. Conclusion: No significant difference between methods at alpha = 0.05. Effect size (Cohen's d) ≈ 0.55, a moderate effect — suggesting the test was underpowered. A power analysis would indicate that detecting a moderate effect at this alpha requires roughly n = 52 per group. Again, the borderline result raises power and effect-size questions. The methods may genuinely differ but the sample is too small to detect the difference reliably. Reporting the effect size makes this clear; reporting only "p > 0.05" hides it.
Key Points
- •Two-sample t-test compares means from two unrelated groups
- •Use Welch's correction by default — robust to unequal variances, reduces to Student's when variances equal
- •Effect size (Cohen's d) interprets practical magnitude regardless of p-value
- •Underpowered tests (small n, moderate effect) frequently produce borderline non-significant results
- •Power analysis BEFORE data collection is the cure for underpowered tests
5. Test Family 3: Paired t-test, ANOVA, and Chi-Square (Brief Worked Examples)
Three more high-frequency tests with worked one-pass examples. Paired t-test. Use when the same subjects are measured twice (before/after, matched pairs). Compute the difference for each pair, then test whether the mean difference equals zero. Example: 15 subjects measured before and after intervention; mean difference = 4.2, SD of differences = 6, n = 15. t = 4.2 / (6/√15) = 4.2 / 1.549 = 2.71, df = 14, two-tailed p = 0.017. Reject H0 at alpha = 0.05 — intervention had a significant effect. One-way ANOVA. Use to compare means across 3 or more groups. Test statistic is F = MS_between / MS_within. Example: three teaching methods, total n = 60 (20 per group), MS_between = 50, MS_within = 12.5. F = 50/12.5 = 4.0, df = (2, 57), p ≈ 0.024. Reject H0 — at least one method differs. Follow up with Tukey's HSD to find which pairs differ. Chi-square test of independence. Use for two categorical variables. Test whether they are independent. Example: 200 voters surveyed by party (D/R/I) and policy preference (yes/no). Compute expected counts under independence, then χ² = Σ (O − E)² / E. If χ² = 12.5 with df = 2 (= (rows − 1)(cols − 1)), p ≈ 0.002 — strong evidence of association between party and preference. Each of these has its own assumption checklist (paired t-test: normality of differences; ANOVA: normality + equal variances per group; chi-square: expected counts ≥ 5 per cell). Violations push you to non-parametric alternatives covered next.
Key Points
- •Paired t-test: same subjects measured twice; analyze differences not raw values
- •One-way ANOVA: compares 3+ group means; F-statistic; follow-up with post-hoc test
- •Chi-square test of independence: 2 categorical variables; expected counts must be ≥ 5
- •Each test has its own assumption checklist; check before computing the statistic
- •When assumptions violated, switch to non-parametric alternative
6. Decision Tree: Which Test to Use
A practical decision tree based on the question, data type, and number of groups. Is the response variable continuous? YES. How many groups? • One group, comparing to a known value → one-sample t-test (or z if σ known). • Two groups, independent → Welch's two-sample t-test (or Mann-Whitney U if non-normal). • Two groups, paired → paired t-test (or Wilcoxon signed-rank if non-normal). • Three or more groups, independent → one-way ANOVA (or Kruskal-Wallis if non-normal). • Two factors → two-way ANOVA. • Repeated measures across 3+ time points → repeated-measures ANOVA or mixed model. NO (categorical response). What kind of question? • Test of independence between two categorical variables → chi-square test. • Goodness-of-fit to a theoretical distribution → chi-square goodness-of-fit. • Two proportions → two-proportion z-test. Bivariate continuous, testing relationship → Pearson correlation (parametric) or Spearman/Kendall (non-parametric); regression for prediction. The parametric vs non-parametric branch turns on assumption checks: shapiro-wilk for normality, levene's test for equal variances, sample size considerations (CLT helps when n > 30 per group). Non-parametric tests have lower power but are robust to violations.
Key Points
- •Continuous outcome + 1 group → one-sample t (or z)
- •Continuous outcome + 2 independent groups → Welch's t (or Mann-Whitney)
- •Continuous outcome + 3+ groups → one-way ANOVA (or Kruskal-Wallis)
- •Two categorical variables → chi-square test of independence
- •Always check assumptions before applying parametric tests — CLT helps if n > 30/group
7. P-Values, Type I and Type II Errors, and Power
A p-value is the probability, ASSUMING the null hypothesis is true, of observing a test statistic at least as extreme as the one observed. It is NOT the probability that the null hypothesis is true. This distinction matters because the most common misinterpretation ("p = 0.04 means there is only a 4% chance the null is true") is wrong. Four outcomes are possible in any hypothesis test: | Reality | Decision: Reject H0 | Decision: Fail to Reject H0 | |---|---|---| | H0 is true | Type I error (alpha) | Correct (1 − alpha) | | H0 is false | Correct (power, 1 − beta) | Type II error (beta) | Type I error (α) is wrongly rejecting a true null — a false positive. Convention sets alpha at 0.05. Type II error (β) is wrongly failing to reject a false null — a false negative. Conventions for beta target 0.20 (power = 0.80). Power (1 − β) is the probability of correctly rejecting a false null. It depends on alpha, effect size, and sample size. Increasing sample size is the only lever that increases power without inflating type I error. A published power lookup table for two-sample t-tests at alpha = 0.05, power = 0.80: | Effect size (Cohen's d) | Required n per group | |---|---:| | 0.20 (small) | 393 | | 0.30 | 175 | | 0.40 | 99 | | 0.50 (medium) | 64 | | 0.60 | 45 | | 0.70 | 33 | | 0.80 (large) | 26 | | 1.00 | 17 | This table is the cure for underpowered studies. Compute the smallest effect size of practical interest, look up the required n, and collect that much data — anything less wastes effort and produces ambiguous results.
Key Points
- •A p-value is the probability of the data (or more extreme) under H0 — not the probability H0 is true
- •Type I error (α): false positive, conventionally 0.05
- •Type II error (β): false negative, conventionally 0.20 (power = 0.80)
- •Power increases with sample size, effect size, and alpha — only sample size is fully controllable
- •Power lookup tables give required n for any target effect size and power level
8. How StatsIQ Helps With Hypothesis Testing
Hypothesis testing problems span every introductory statistics course, the AP Statistics exam, and the first month of every applied research program. The decision tree (which test) and the assumption checks (which form) are exactly where most students get stuck. Snap a photo of any hypothesis testing problem and StatsIQ identifies the appropriate test, checks assumptions if data is provided, computes the test statistic and p-value, and produces the final reject / fail-to-reject conclusion in plain language. For power and sample-size calculations, StatsIQ produces the required n given alpha, power, and effect size.
Key Points
- •Identifies the correct test from problem description (decision tree automated)
- •Checks assumptions when data is available (normality, equal variances, expected counts)
- •Computes test statistic, p-value, and critical value
- •Produces conclusion in plain language tied to the research question
- •Useful for AP Statistics, intro stats, methods courses, and applied research design
9. Common Mistakes to Avoid
Six errors recur. First, conflating "fail to reject H0" with "accept H0" — failing to reject just means the data does not provide enough evidence against H0, not that H0 is true. Second, p-hacking: running multiple tests and reporting only the significant ones, or adjusting alpha after seeing the data. The fix is pre-registration of analysis plans. Third, ignoring effect size: a statistically significant result with tiny effect size is not practically meaningful, and a non-significant result with large effect size signals an underpowered study. Fourth, using a one-tailed test post hoc to make a non-significant two-tailed test significant. The directional choice must be made before seeing the data and justified by theory. Fifth, applying parametric tests without checking assumptions: if data is severely non-normal and n is small, switch to a non-parametric test. Sixth, reporting "p = 0.000" — p-values are never exactly zero. Report p < 0.001 instead.
Key Points
- •"Fail to reject" ≠ "accept" — absence of evidence is not evidence of absence
- •P-hacking inflates type I error; pre-register analyses to prevent it
- •Always report effect size alongside p-value
- •Tail direction (one vs two) must be set BEFORE seeing data
- •Check assumptions before using parametric tests; switch to non-parametric on violations
Key Takeaways
- ★Hypothesis testing follows a seven-step framework: state hypotheses, set alpha, select test, compute statistic, get p-value, decide, conclude
- ★P-value = P(data or more extreme | H0 true), NOT P(H0 true | data)
- ★Reject H0 when p-value < alpha; fail to reject otherwise
- ★Type I error (α) = false positive; Type II error (β) = false negative; Power = 1 − β
- ★Conventional alpha = 0.05; conventional power = 0.80
- ★One-sample t-test: compares sample mean to a known value (df = n − 1)
- ★Welch's two-sample t-test: independent groups, unequal variances assumed (default modern practice)
- ★Paired t-test: same subjects, two measurements; analyze differences
- ★One-way ANOVA: 3+ independent groups; F-statistic; follow with post-hoc Tukey HSD
- ★Chi-square test of independence: two categorical variables; expected counts must be ≥ 5
- ★Non-parametric alternatives (Mann-Whitney, Wilcoxon, Kruskal-Wallis) when assumptions fail
- ★Power for d = 0.5 (medium effect) at alpha = 0.05, power = 0.80 requires n ≈ 64 per group
Practice Questions
1. A sample of n = 36 has x̄ = 105, s = 12. Test H0: μ = 100 against H1: μ ≠ 100 at alpha = 0.05.
2. Which test compares means from two independent groups when variances are unequal?
3. A paired-sample design has mean difference = 3.5, SD of differences = 7, n = 20. Test H0: μ_d = 0 at alpha = 0.05.
4. A one-way ANOVA has 4 groups with n = 10 each. The F-statistic is 4.5. What are the degrees of freedom and the rough p-value?
5. A study has alpha = 0.05, power = 0.80, and is designed to detect Cohen's d = 0.5. What sample size per group does this require for a two-sample t-test?
6. A test produces p-value = 0.03 at alpha = 0.05. State the decision and the type of error that could still be wrong.
7. Why would we ever fail to reject the null when the alternative might actually be true?
FAQs
Common questions about this topic
A p-value is the probability of seeing the observed result (or something more extreme) IF the null hypothesis were true. It quantifies how surprising the data would be in a world where there is no effect. Small p-values mean the data is very unlikely under the null — making the null less plausible. P-values do NOT measure the probability that the null hypothesis is true; they only measure how compatible the data is with the null.
Type I error (α) is wrongly rejecting a true null hypothesis — a false positive. Type II error (β) is wrongly failing to reject a false null — a false negative. Alpha is conventionally set at 0.05; beta is conventionally set at 0.20 (giving power = 0.80). The two error rates trade off: reducing alpha (e.g., to 0.01) raises beta unless sample size also increases. Both can be controlled simultaneously only by collecting more data.
Use a two-tailed test when the research question is "is there a difference" without specifying direction. Use a one-tailed test only when prior theory or a prior study justifies a directional alternative (e.g., "the new drug should not be worse than placebo"). Tail direction must be set BEFORE looking at the data — choosing one-tailed after a non-significant two-tailed result is a form of p-hacking. Many researchers default to two-tailed because it is more conservative and rarely wrong.
Use Welch's t-test by default. It does not require equal variances and reduces to Student's when variances happen to be equal. The traditional approach was to pre-test for equal variances (Levene's test) and switch between Student's and Welch's based on the result, but modern statistical practice now recommends defaulting to Welch's because the pre-test introduces its own type I error inflation. R, Python (scipy.stats.ttest_ind with equal_var=False), and SPSS all default to Welch's in modern versions.
Power is the probability of correctly rejecting a false null hypothesis (1 − β). High power means the test is likely to detect a real effect; low power means it might miss one. Power depends on alpha, sample size, and the true effect size. The conventional target is 0.80 (80% power). Power analysis BEFORE data collection determines the required sample size to detect a meaningful effect; power analysis AFTER non-significant results explains why the study may have missed a real effect.
Yes. Modern reporting standards (APA, AMA, most journals) require effect size alongside p-values because statistical significance does not measure practical importance. A study with n = 10,000 can produce p < 0.001 for an effect so small it is meaningless; a study with n = 30 can produce p > 0.05 for a large effect that is missed due to underpower. Common effect sizes: Cohen's d for mean differences, r for correlations, η² (eta-squared) for ANOVA, odds ratio for binary outcomes.
Use non-parametric tests when parametric assumptions (normality, equal variances, interval-scale data) are violated and sample size is too small for the Central Limit Theorem to rescue you. Common substitutions: Mann-Whitney U for two-sample t-test, Wilcoxon signed-rank for paired t-test, Kruskal-Wallis for one-way ANOVA, Spearman correlation for Pearson. Non-parametric tests have slightly lower power than parametric tests when assumptions are met but are robust to violations.
Yes. Snap a photo of any hypothesis testing problem and StatsIQ identifies the correct test (one-sample t, two-sample t, paired t, ANOVA, chi-square, non-parametric), checks assumptions when data is available, computes the test statistic and p-value, and produces a complete reject / fail-to-reject conclusion. StatsIQ also handles power and sample-size calculations. This content is for educational purposes only and does not constitute statistical advice.