Chi-Square Goodness-of-Fit Test: Step-by-Step Worked Examples
A walkthrough of the chi-square goodness-of-fit test from null hypothesis through expected counts, the chi-square statistic, degrees of freedom, and decision rule — with three worked examples (fair die, genetic ratios, Benford’s law) and the assumption checks most students skip.
What You'll Learn
- ✓Recognize when to use chi-square goodness-of-fit vs other chi-square tests.
- ✓Compute expected counts, the chi-square statistic, and degrees of freedom correctly.
- ✓Check the expected-count and independence assumptions before trusting the p-value.
1. Direct Answer: What the Chi-Square Goodness-of-Fit Test Does
The chi-square goodness-of-fit test asks whether observed category counts match a hypothesized distribution. You compute expected counts under the null, sum (observed minus expected) squared divided by expected across all categories, compare to a chi-square distribution with k-1 degrees of freedom (where k is the number of categories, minus one more for each parameter you estimate from the data), and reject the null if the p-value falls below alpha. The test handles questions like "is this die fair," "do these phenotype ratios match Mendelian 9:3:3:1," or "do these first digits follow Benford’s law." It does not test independence between two variables — that is a different chi-square test using a contingency table.
Key Points
- •Compares observed counts to expected counts under a hypothesized distribution.
- •Degrees of freedom = number of categories minus 1, minus parameters estimated from data.
- •Different from the chi-square test of independence (which uses a contingency table).
2. Worked Example 1: Is the Die Fair?
You roll a six-sided die 120 times and record: 1→14, 2→22, 3→18, 4→24, 5→20, 6→22. Null hypothesis: each face has probability 1/6, so expected count per face is 120/6 = 20. Compute (14-20)²/20 + (22-20)²/20 + (18-20)²/20 + (24-20)²/20 + (20-20)²/20 + (22-20)²/20 = 1.8 + 0.2 + 0.2 + 0.8 + 0 + 0.2 = 3.2. Degrees of freedom = 6 - 1 = 5 (we did not estimate any parameters — the 1/6 probability comes from the hypothesis, not the data). Critical value for chi-square with df=5 at alpha=0.05 is 11.07. Our statistic 3.2 is far below, p ≈ 0.67. Fail to reject. The die behaves consistently with fairness.
Key Points
- •Expected count = total observations × hypothesized probability for each category.
- •No parameters estimated from data → df = categories - 1.
- •A large p-value here just means "no evidence against fairness," not "the die is proven fair."
3. Worked Example 2: Mendelian 9:3:3:1 Phenotype Ratios
A dihybrid cross predicts 9:3:3:1 phenotypes (round-yellow, round-green, wrinkled-yellow, wrinkled-green). You observe 556 plants: 315, 108, 101, 32. Expected proportions are 9/16, 3/16, 3/16, 1/16, giving expected counts of 312.75, 104.25, 104.25, 34.75. Chi-square = (315-312.75)²/312.75 + (108-104.25)²/104.25 + (101-104.25)²/104.25 + (32-34.75)²/34.75 = 0.016 + 0.135 + 0.101 + 0.218 = 0.470. Degrees of freedom = 4 - 1 = 3. Critical value at alpha=0.05 is 7.81. We fail to reject — the data fits Mendelian inheritance well, which historically is exactly the conclusion Mendel’s peas supported. Statistician R.A. Fisher famously argued in 1936 that Mendel’s numbers were "too good," with chi-square values implausibly low, but that is a separate analysis.
Key Points
- •Expected counts can be fractional — do not round them before computing.
- •For genetic ratios the null specifies the proportions directly.
- •A very small chi-square is statistically interesting in its own right (Fisher’s "too good to be true" critique).
4. Worked Example 3: Benford’s Law on First Digits
Benford’s law predicts that first digits in many real-world datasets follow log10(1 + 1/d). Probabilities: 1→30.1%, 2→17.6%, 3→12.5%, 4→9.7%, 5→7.9%, 6→6.7%, 7→5.8%, 8→5.1%, 9→4.6%. You audit 1,000 invoice totals and observe leading-digit counts: 297, 175, 118, 102, 80, 70, 60, 50, 48. Expected counts (probability × 1000): 301, 176, 125, 97, 79, 67, 58, 51, 46. Chi-square = sum of (O-E)²/E = 0.053 + 0.006 + 0.392 + 0.258 + 0.013 + 0.134 + 0.069 + 0.020 + 0.087 = 1.03. Degrees of freedom = 9 - 1 = 8. Critical value at alpha=0.05 is 15.51. Fail to reject — the invoices behave like genuine accounting data. Forensic accountants use exactly this test to flag fabricated expense reports, where the digits tend to deviate noticeably from Benford’s pattern.
Key Points
- •Benford’s law applies to "natural" data spanning multiple orders of magnitude.
- •Used in forensic accounting and fraud detection.
- •Failing the test does not prove fraud — investigate further before drawing conclusions.
5. The Assumption Most Students Skip: Expected Counts ≥ 5
The chi-square distribution is an approximation to the discrete sampling distribution of the test statistic. That approximation breaks down when expected counts get small. The standard rule: every expected count should be at least 5, or at minimum no more than 20% of expected counts below 5 and none below 1. If your data violates this, combine sparse categories or switch to Fisher’s exact test. Here is the uncomfortable truth most textbooks gloss over: when expected counts are around 1-2, the chi-square statistic can produce wildly inflated values that look "significant" but are artifacts of the approximation, not real signal. Always check your expected counts before reporting a p-value.
Key Points
- •Expected counts ≥ 5 in every cell (or no more than 20% below 5, none below 1).
- •When expected counts are too small, combine categories or use Fisher’s exact test.
- •Small expected counts inflate the chi-square statistic and produce false positives.
6. Degrees of Freedom: The Subtle Adjustment Many Students Miss
Degrees of freedom for goodness-of-fit is categories minus 1, minus one additional for every parameter you estimated from the data. If you test "is this normal" by estimating the mean and standard deviation from the sample and binning into k categories, df = k - 1 - 2 = k - 3. Forget this adjustment and your test becomes too conservative — you fail to reject when you should. This is one of the most common errors in applied chi-square testing and a frequent free-response trap on AP Statistics. Mendelian ratios? No parameters estimated. Fair die? No parameters estimated. Fit-a-Poisson-to-counts? You estimated lambda, so subtract one more from df.
Key Points
- •df = categories - 1 - (parameters estimated from the data).
- •Testing "fit to Poisson" by estimating lambda: df = categories - 2.
- •Forgetting the parameter adjustment makes the test too conservative.
7. Using StatsIQ to Run Goodness-of-Fit on Your Data
Snap a photo of a frequency table and StatsIQ identifies the most plausible hypothesized distribution, computes expected counts, the chi-square statistic, and df with parameter adjustments handled automatically. It flags assumption violations (small expected counts) and suggests Fisher’s exact or category collapsing when appropriate. The walkthrough shows every intermediate calculation so you can verify by hand. For exam prep, the app generates practice problems with worked solutions at three difficulty levels.
Key Points
- •Snap a frequency table for an automated end-to-end goodness-of-fit run.
- •StatsIQ handles the df adjustment for estimated parameters.
- •Assumption violations are flagged before the p-value is reported.
Key Takeaways
- ★Chi-square goodness-of-fit compares observed counts to expected counts under a hypothesized distribution.
- ★Test statistic: sum of (Observed - Expected)² / Expected across all categories.
- ★Degrees of freedom: categories - 1 - parameters estimated from data.
- ★Assumption: expected counts ≥ 5 in every cell (or no more than 20% below 5, none below 1).
- ★Different from chi-square test of independence (which uses a contingency table of two variables).
Practice Questions
1. A bag of M&M’s claims color proportions of 24% blue, 14% brown, 16% green, 20% orange, 13% red, 13% yellow. You sample 500 candies and observe 110, 65, 90, 105, 70, 60. Test at alpha=0.05.
2. You bin 200 observations into 8 categories and estimate one parameter (lambda for a Poisson fit) from the data. The chi-square statistic is 14.2. Test at alpha=0.05.
3. You compute a chi-square of 0.18 with df=3 testing Mendelian ratios. Is anything wrong?
FAQs
Common questions about this topic
Goodness-of-fit tests whether one categorical variable’s distribution matches a hypothesized distribution (one row of observed counts vs hypothesized probabilities). Independence tests whether two categorical variables are related (a two-way contingency table). The arithmetic of computing the statistic is similar, but the question being asked and the degrees of freedom differ. For independence: df = (rows-1) × (columns-1). For goodness-of-fit: df = categories - 1 - estimated parameters.
The chi-square distribution is a continuous approximation to the discrete sampling distribution of the test statistic. With small expected counts, the discrete distribution has fewer possible values, the approximation breaks down, and the test statistic can take on extreme values purely by chance. Below expected counts of 5, the false-positive rate rises noticeably above your nominal alpha. The fix: combine sparse categories, switch to Fisher’s exact test, or collect more data.
Only after binning into categories. The cost: you lose information by discretizing a continuous variable, and the choice of bin boundaries affects your conclusion. For testing fit to a continuous distribution (normal, exponential, etc.), the Kolmogorov-Smirnov test, Anderson-Darling test, or Shapiro-Wilk test are typically better choices because they do not depend on arbitrary bin choices. Use chi-square for genuinely categorical data: dice faces, color categories, phenotypes, discrete distributions.
Large sample sizes can make tiny deviations statistically significant. Always look at the standardized residuals (Observed - Expected) / sqrt(Expected) per cell to identify which categories drive the rejection. A significant chi-square with small standardized residuals across the board may reflect a sample-size effect rather than a meaningful departure from the hypothesized distribution. Report effect sizes (Cramér’s V for two-way tables, or compare observed-vs-expected proportions directly) alongside the p-value.
Chi-square goodness-of-fit follows the same five-step framework as any hypothesis test: state the hypotheses, choose alpha, compute the test statistic, find the p-value (or compare to a critical value), make a decision. See the hypothesis testing complete guide for the framework, and the type I vs type II errors guide for what your decision can go wrong in either direction. The chi-square distribution itself is one of the standard distributions covered in the probability distributions complete guide.
Snap a photo of your frequency table or hypothesis (e.g., "test if this die is fair given these counts") and StatsIQ walks through the full calculation: expected counts, chi-square statistic, degrees of freedom (with adjustment for any parameters you estimated), p-value, and interpretation. It also flags assumption violations such as expected counts below 5 and suggests appropriate alternatives. This content is for educational purposes only.