Two-Sample t-Test Step by Step: Hypotheses, Calculation, and Interpretation With a Worked Example
A complete step-by-step walkthrough of the independent two-sample t-test โ from stating hypotheses through calculating the test statistic, finding the p-value, and writing the conclusion. Includes a fully worked numerical example that you can follow along with.
What You'll Learn
- โState the null and alternative hypotheses for a two-sample t-test
- โCalculate the pooled standard error, t-statistic, and degrees of freedom
- โDetermine the p-value and make a decision at a given significance level
- โWrite a conclusion in context that answers the original research question
1. The Direct Answer: What the Two-Sample t-Test Does
The independent two-sample t-test determines whether two group means are significantly different from each other. Example: does a new drug lower blood pressure more than a placebo? The test compares the difference between the sample means to the variability within the samples. If the difference is large relative to the variability, the test concludes that the groups are truly different (not just random fluctuation). The formula: t = (xฬโ - xฬโ) / โ(sยฒp/nโ + sยฒp/nโ), where xฬโ and xฬโ are the sample means, sยฒp is the pooled variance, and nโ and nโ are the sample sizes. The assumptions: both groups are independent (no subject appears in both), the dependent variable is continuous and approximately normally distributed in each group (or n > 30 per group), and the variances in both groups are approximately equal (check with Levene's test โ if variances are unequal, use Welch's t-test instead). Snap a photo of any t-test problem and StatsIQ identifies the test type, checks the assumptions, calculates the test statistic, determines the p-value, and writes the conclusion โ step by step with every formula shown.
Key Points
- โขThe t-test compares two group means: is the difference real or just random variation?
- โขt = (mean difference) / (standard error of the difference). Larger t = more evidence of a real difference.
- โขAssumptions: independent groups, continuous DV, approximately normal, equal variances
- โขIf variances are unequal (Levene's p < 0.05), use Welch's t-test instead of the pooled version
2. Worked Example: Step by Step With Real Numbers
Research question: Does caffeine improve reaction time? Two groups: caffeine (n=10) and placebo (n=10). Reaction time measured in milliseconds (lower = faster). Data: Caffeine group: xฬโ = 245 ms, sโ = 30 ms. Placebo group: xฬโ = 268 ms, sโ = 35 ms. **Step 1: State hypotheses.** Hโ: ฮผโ = ฮผโ (no difference in mean reaction time between groups) Hโ: ฮผโ โ ฮผโ (there is a difference) โ this is a two-tailed test. Significance level: ฮฑ = 0.05. **Step 2: Calculate pooled variance.** sยฒp = [(nโ-1)sโยฒ + (nโ-1)sโยฒ] / (nโ + nโ - 2) sยฒp = [(9)(900) + (9)(1225)] / 18 sยฒp = [8100 + 11025] / 18 = 19125 / 18 = 1062.5 **Step 3: Calculate the t-statistic.** t = (xฬโ - xฬโ) / โ(sยฒp/nโ + sยฒp/nโ) t = (245 - 268) / โ(1062.5/10 + 1062.5/10) t = -23 / โ(106.25 + 106.25) t = -23 / โ212.5 t = -23 / 14.58 t = -1.578 **Step 4: Degrees of freedom and p-value.** df = nโ + nโ - 2 = 10 + 10 - 2 = 18 Using a t-table or calculator with df=18 and t=-1.578 (two-tailed): p โ 0.132 **Step 5: Decision and conclusion.** p = 0.132 > ฮฑ = 0.05. Fail to reject Hโ. Conclusion: There is not sufficient evidence at the 0.05 significance level to conclude that caffeine significantly affects reaction time. The 23 ms difference between groups could be due to random variation. StatsIQ shows every one of these steps when you snap a photo of a t-test problem โ including the pooled variance calculation that most students find hardest.
Key Points
- โขStep 1: Hโ (no difference) vs Hโ (difference exists). Set ฮฑ (usually 0.05).
- โขStep 2: Pooled variance combines both groups' variability: sยฒp = weighted average of sโยฒ and sโยฒ
- โขStep 3: t = mean difference รท standard error. Larger |t| = stronger evidence against Hโ.
- โขStep 4: df = nโ+nโ-2. Look up p-value. Step 5: if p < ฮฑ, reject Hโ. If p โฅ ฮฑ, fail to reject.
3. Writing the Conclusion: What to Say and What Not to Say
The conclusion must include four elements: the decision (reject or fail to reject Hโ), the significance level, the test statistic and p-value, and an interpretation in context. Good conclusion for our example: "An independent samples t-test was conducted to compare reaction times between the caffeine group (M = 245, SD = 30) and the placebo group (M = 268, SD = 35). There was no statistically significant difference between the groups, t(18) = -1.578, p = .132. Caffeine did not significantly improve reaction time in this sample." What NOT to say: "We accept the null hypothesis." You never accept Hโ โ you either reject it or fail to reject it. Failing to reject means the evidence was not strong enough to conclude a difference, not that you have proven the groups are equal. The distinction matters because a small sample might simply lack the power to detect a real difference. Also do not say: "The result is insignificant." Say "not statistically significant." Insignificant implies the result does not matter. Not statistically significant means the evidence did not reach the threshold โ a subtle but important difference. With a larger sample, the same 23 ms difference might reach significance because the standard error would be smaller. Effect size complements the p-value: Cohen's d = (xฬโ - xฬโ) / sp = -23 / โ1062.5 = -23 / 32.6 = -0.71. This is a medium-to-large effect. The practical difference (23 ms faster) might be meaningful even though it is not statistically significant with n=10 per group โ the study may have been underpowered.
Key Points
- โขReport: test type, t-value, df, p-value, group means and SDs, and interpretation in context
- โขNEVER say "accept Hโ" โ say "fail to reject Hโ" (insufficient evidence, not proof of no difference)
- โข"Not statistically significant" โ "no difference." It means the evidence was not strong enough at this sample size.
- โขCohen's d measures effect SIZE independent of sample size: small (0.2), medium (0.5), large (0.8)
4. Common Variations and When to Use Each
Paired (dependent) t-test: when the same subjects are measured twice (before/after, or matched pairs). The formula uses the differences within each pair rather than comparing group means: t = dฬ / (sd / โn), where dฬ is the mean of the differences and sd is the standard deviation of the differences. Use when: pre-test/post-test designs, matched-pair experiments, or when each subject serves as their own control. Welch's t-test: when the equal variance assumption is violated (Levene's test p < 0.05 or one SD is more than double the other). Welch's does not pool the variances โ it calculates the standard error from each group's variance separately and adjusts the degrees of freedom downward. Most statistical software defaults to Welch's because it performs well even when variances are equal. If your professor does not specify, Welch's is the safer choice. One-sample t-test: comparing a single group mean to a known value (not another group). Example: is this class's average test score significantly different from the national average of 75? t = (xฬ - ฮผโ) / (s / โn). Use when: you have one sample and a hypothesized population mean. The choice between these three depends on study design: independent groups โ independent t-test (or Welch's). Same subjects measured twice โ paired t-test. One group vs a known value โ one-sample t-test. StatsIQ identifies which variant applies from the problem description and solves accordingly.
Key Points
- โขPaired t-test: same subjects, two measurements. Uses within-pair differences.
- โขWelch's t-test: unequal variances. Does not pool โ adjusts df downward. Safer default.
- โขOne-sample t-test: one group vs a known population value (e.g., national average).
- โขStudy design determines the variant: independent groups, paired/repeated, or single group vs known value.
Key Takeaways
- โ t = (xฬโ - xฬโ) / SE. Larger |t| = stronger evidence against Hโ.
- โ Pooled variance: sยฒp = [(nโ-1)sโยฒ + (nโ-1)sโยฒ] / (nโ+nโ-2). Weights by sample size.
- โ df = nโ + nโ - 2 for pooled t-test. Welch's has adjusted (usually lower) df.
- โ Never "accept Hโ" โ only reject or fail to reject. Failing to reject โ proving no difference.
- โ Cohen's d: effect size independent of sample size. d = 0.2 small, 0.5 medium, 0.8 large.
Practice Questions
1. Group A (n=15): mean = 82, SD = 12. Group B (n=15): mean = 75, SD = 10. Test at ฮฑ = 0.05 whether the means differ.
FAQs
Common questions about this topic
A z-test is used when the population standard deviation (ฯ) is known. A t-test is used when ฯ is unknown and must be estimated from the sample (s). In practice, ฯ is almost never known, so the t-test is used in virtually all real applications. The z-test appears in textbooks primarily as an introduction to hypothesis testing before the t-test is taught.
Yes. Snap a photo of any t-test problem and StatsIQ identifies the variant (independent, paired, one-sample, Welch's), states the hypotheses, calculates the pooled variance and t-statistic, determines the p-value, computes Cohen's d, and writes the conclusion โ all step by step with every formula shown.