How to Read ANOVA Output: Sum of Squares, Mean Square, F-Statistic, and Post-Hoc Tests
ANOVA tables show up in every statistics course and every research paper that compares group means, but most students stare at the rows and columns without knowing what any of them actually mean. This guide walks through every piece of ANOVA output โ sum of squares, degrees of freedom, mean square, the F-statistic, and the p-value โ and then explains what to do after a significant result with post-hoc tests.
What You'll Learn
- โIdentify and interpret every row and column in a standard ANOVA output table
- โExplain what sum of squares between and within groups measure and how they partition total variability
- โCalculate the F-statistic from mean squares and interpret it in the context of the null hypothesis
- โApply Tukey HSD and Bonferroni post-hoc tests after a significant ANOVA to identify which groups differ
1. What ANOVA Output Actually Tells You
An ANOVA table answers one question: are the means of three or more groups different enough that the differences are unlikely to be caused by random sampling alone? The table partitions the total variability in your data into two sources โ variability between groups and variability within groups โ and then compares them using the F-statistic. If the between-group variability is large relative to the within-group variability, the group means are probably different. If it is small, the differences you see could easily be noise. Every row, column, and number in the table serves this comparison. Once you understand the logic, reading the output becomes mechanical rather than mysterious.
Key Points
- โขANOVA partitions total variability into between-group and within-group components
- โขThe F-statistic compares how much groups differ from each other versus how much individuals differ within groups
- โขA significant F-test means at least one group mean differs โ it does not tell you which ones
2. Sum of Squares: Where the Variability Lives
The sum of squares (SS) quantifies variability by adding up squared deviations. ANOVA output reports three sum of squares values. SS Between (also called SS Treatment or SS Factor) measures how much the group means differ from the overall grand mean. If all group means were identical, SS Between would be zero. Large values mean the groups are spread apart. The formula is SS Between = sum of n_i times (x-bar_i minus grand mean) squared, where n_i is the size of each group and x-bar_i is each group mean. SS Within (also called SS Error or SS Residual) measures how much individual observations within each group differ from their own group mean. This is the natural variability โ the noise โ that exists even if the treatment has no effect. The formula sums (x_ij minus x-bar_i) squared across all observations in all groups. SS Total is simply SS Between plus SS Within. It measures the total variability in the entire dataset, ignoring group membership entirely. This additive relationship is fundamental: the total variation in your data is completely decomposed into variation explained by group membership and variation left unexplained. When you report eta-squared (SS Between divided by SS Total), you are reporting the proportion of total variability that the grouping variable explains โ the ANOVA equivalent of R-squared in regression.
Key Points
- โขSS Between measures variability of group means around the grand mean โ how different the groups are
- โขSS Within measures variability of individual observations around their own group mean โ the noise
- โขSS Total = SS Between + SS Within โ total variability decomposes completely into these two sources
- โขEta-squared = SS Between / SS Total gives the proportion of variance explained by the grouping variable
3. Degrees of Freedom, Mean Squares, and the F-Statistic
Sum of squares alone cannot be compared directly because SS Between and SS Within are based on different numbers of independent pieces of information. You need to adjust for that, which is where degrees of freedom (df) and mean squares (MS) come in. Degrees of freedom for Between = k minus 1, where k is the number of groups. If you have 4 groups, df Between = 3. Degrees of freedom for Within = N minus k, where N is the total number of observations. If you have 40 observations across 4 groups, df Within = 36. Mean Square is the sum of squares divided by its degrees of freedom: MS Between = SS Between divided by (k minus 1), and MS Within = SS Within divided by (N minus k). Mean squares are variance estimates โ MS Within estimates the common within-group variance (sigma squared), while MS Between estimates the within-group variance plus any additional variability due to real group differences. The F-statistic is the ratio: F = MS Between divided by MS Within. If the null hypothesis is true (all population means are equal), both MS values estimate the same thing (sigma squared), so their ratio should be close to 1. If the group means really do differ, MS Between will be inflated by the group differences, pushing F above 1. The further F is above 1, the stronger the evidence against equal means. You compare F to the F-distribution with df1 = k minus 1 and df2 = N minus k to get the p-value. Snap a photo of the ANOVA output and StatsIQ walks through each row, connecting the numbers to the underlying logic so you understand what F = 4.37 actually means in your specific problem.
Key Points
- โขdf Between = k - 1 (number of groups minus 1); df Within = N - k (total observations minus number of groups)
- โขMean Square = SS / df โ it is a variance estimate that adjusts for degrees of freedom
- โขF = MS Between / MS Within โ under H0, F should be near 1; large F values indicate real group differences
- โขThe p-value comes from comparing F to the F-distribution with df1 = k-1 and df2 = N-k
4. Reading a Complete ANOVA Table: Worked Example
A researcher compares test scores across three teaching methods (lecture, flipped, hybrid) with 20 students per group (N = 60). The output table is: Source | SS | df | MS | F | p-value Between | 840 | 2 | 420 | 5.25 | 0.008 Within | 4560 | 57 | 80 | โ | โ Total | 5400 | 59 | โ | โ | โ Here is how to read every number. SS Between = 840 means the group means vary around the grand mean by a total squared deviation of 840. df Between = 2 because there are 3 groups minus 1. MS Between = 840 divided by 2 = 420 โ this is the average squared deviation of group means, adjusted for the number of groups. SS Within = 4560 means individual scores within each group vary around their own group mean by a total of 4560. df Within = 57 because 60 total observations minus 3 groups. MS Within = 4560 divided by 57 = 80 โ this estimates the common within-group variance. F = 420 divided by 80 = 5.25. The between-group variance estimate is 5.25 times larger than the within-group variance estimate. Under the null hypothesis, we would expect this ratio to be around 1. A ratio of 5.25 is large enough to produce p = 0.008. Since p = 0.008 is less than alpha = 0.05, we reject the null hypothesis. At least one teaching method produces a different mean score. Eta-squared = 840 divided by 5400 = 0.156 โ the teaching method explains about 15.6% of the variability in test scores, which is a medium-to-large effect by Cohen's benchmarks. But we still don't know which methods differ from which. That requires post-hoc tests.
Key Points
- โขRead the table left to right: Source tells you the component, SS quantifies variability, df adjusts for information, MS = SS/df
- โขF is just the ratio of two mean squares โ think of it as signal divided by noise
- โขAlways calculate eta-squared (SS Between / SS Total) to report effect size alongside the p-value
- โขA significant F-test is the starting point, not the endpoint โ post-hoc tests identify specific differences
5. Post-Hoc Tests: Finding Which Groups Differ
A significant ANOVA F-test tells you that at least one group mean differs from the others. It does not tell you which groups differ. To answer that question, you run post-hoc (after-the-fact) pairwise comparison tests. These tests compare every pair of group means while controlling for the increased risk of false positives from running multiple comparisons. Why not just run multiple t-tests? Because the more comparisons you run, the higher the chance of at least one false positive. With 3 groups, there are 3 pairwise comparisons. With 5 groups, there are 10. With 10 groups, there are 45. If each test uses alpha = 0.05, the probability of at least one false positive across all tests โ the family-wise error rate โ grows quickly. For 10 comparisons at alpha = 0.05, the family-wise error rate is about 40%, not 5%. Tukey's HSD (Honestly Significant Difference) is the most common post-hoc test. It controls the family-wise error rate at your chosen alpha while comparing all pairs of means. It calculates a critical difference: if the absolute difference between two group means exceeds the Tukey critical value times the standard error, those means are significantly different. Tukey HSD is the default choice when you want to compare all pairs. Bonferroni correction is simpler but more conservative: divide alpha by the number of comparisons. For 3 comparisons at alpha = 0.05, each comparison uses alpha = 0.05 divided by 3 = 0.0167. This controls the family-wise error rate but is more likely to miss real differences (lower power) than Tukey, especially with many groups. Dunnett's test is specialized for comparing each treatment group to a single control group (not all pairwise comparisons). It is more powerful than Tukey or Bonferroni when you only care about treatment-versus-control differences. The choice of post-hoc test should match your research question: Tukey for all pairs, Dunnett for treatment-vs-control, Bonferroni for a small number of planned comparisons.
Key Points
- โขMultiple t-tests inflate the family-wise error rate โ post-hoc tests correct for this
- โขTukey HSD compares all pairs of means while controlling the family-wise error rate at alpha
- โขBonferroni divides alpha by the number of comparisons โ simple but conservative with many groups
- โขDunnett is best when you only need to compare treatments to a single control group
6. Assumptions, Violations, and What to Do About Them
ANOVA assumes three things about your data, and violating them can produce misleading results. Independence means the observations are not related to each other. Each data point comes from a different, independently selected subject. This assumption is violated by repeated-measures designs (the same person measured multiple times), which require repeated-measures ANOVA instead of one-way ANOVA. Normality means the data within each group are approximately normally distributed. ANOVA is robust to moderate violations of normality, especially with balanced designs (equal group sizes) and sample sizes above 15-20 per group. For severely non-normal data or small samples, the Kruskal-Wallis test is a non-parametric alternative that does not assume normality. Equal variances (homoscedasticity) means the spread of data is similar across groups. Levene's test checks this assumption: if Levene's test is significant (p less than 0.05), the variances are unequal. When variances are unequal, the standard F-test can produce inflated Type I error rates. Welch's ANOVA is the recommended alternative โ it adjusts the degrees of freedom to account for unequal variances and does not require the homoscedasticity assumption. A practical approach: run Levene's test first. If it is non-significant, use standard ANOVA. If it is significant, switch to Welch's ANOVA. For post-hoc tests after Welch's ANOVA, use the Games-Howell procedure instead of Tukey HSD, since Games-Howell does not assume equal variances.
Key Points
- โขIndependence is the most critical assumption โ violated in repeated-measures designs
- โขANOVA is robust to moderate normality violations, especially with balanced groups of n greater than 15
- โขRun Levene's test for equal variances โ if it fails, switch to Welch's ANOVA
- โขUse Games-Howell instead of Tukey HSD when variances are unequal
Key Takeaways
- โ SS Total = SS Between + SS Within โ total variability is completely partitioned into between-group and within-group components
- โ F = MS Between / MS Within. Under H0 (equal means), F is approximately 1. Large F values indicate real group differences.
- โ Eta-squared = SS Between / SS Total measures effect size. Cohen benchmarks: 0.01 small, 0.06 medium, 0.14 large.
- โ Tukey HSD is the default post-hoc test for all pairwise comparisons after a significant ANOVA
- โ Welch ANOVA replaces the standard F-test when Levene's test indicates unequal group variances
- โ Post-hoc tests are only run after a significant omnibus F-test โ never run them if F is non-significant
Practice Questions
1. An ANOVA table shows SS Between = 500, SS Within = 2000, df Between = 4, df Within = 45. Calculate MS Between, MS Within, F, and eta-squared.
2. After a significant one-way ANOVA comparing 5 groups, a student runs 10 independent t-tests at alpha = 0.05 without correction. What is the approximate family-wise error rate?
3. Levene's test returns p = 0.02. What should you do before proceeding with group comparisons?
FAQs
Common questions about this topic
A non-significant F-test (p greater than alpha) means you do not have sufficient evidence to conclude that any group means differ. It does not prove the means are equal โ your study may have lacked power to detect real differences. Do not run post-hoc tests after a non-significant F-test.
Yes, but unequal group sizes reduce robustness to violations of the equal variances assumption and can inflate Type I error rates. If group sizes are very different, check Levene's test carefully and consider Welch's ANOVA. Balanced designs (equal group sizes) are always preferable when you can control the study design.
Snap a photo of an ANOVA table or problem statement and StatsIQ identifies the design, calculates each row of the table step by step, interprets the F-statistic and p-value in context, computes eta-squared for effect size, and walks through the appropriate post-hoc test if the result is significant.