Simpson's Paradox: Confounding Variables and Stratification (Worked Examples)
How to recognize, diagnose, and resolve Simpson's Paradox โ when an aggregated trend reverses or disappears within subgroups. Covers the classic Berkeley admissions case, kidney stone treatment data, and the stratification analysis that reveals the true relationship.
What You'll Learn
- โRecognize when Simpson's Paradox is occurring
- โIdentify the lurking variable causing the reversal
- โApply stratification to reveal the true within-group relationships
- โDecide which level of analysis (aggregated or stratified) to report
1. Direct Answer: What Simpson's Paradox Is
Simpson's Paradox occurs when a trend visible in aggregated data reverses or disappears when the data is broken down by a confounding variable. The reversal is real โ both the aggregate trend and the subgroup trend are mathematically correct โ and the paradox arises from differential sample sizes across subgroups combined with different baseline rates. The key insight: aggregated statistics can be misleading when the groups being aggregated have different distributions of a confounding variable. Resolution requires identifying the confounding variable and analyzing within strata. The three classic examples: UC Berkeley graduate admissions in 1973 (looked sex-discriminatory in aggregate but each department was actually fair when analyzed separately), kidney stone treatment data (Treatment A appeared worse overall but better for both small and large stones), and various COVID vaccine effectiveness analyses (raw infection rates can mislead when age strata differ).
Key Points
- โขSimpson's Paradox = aggregated trend reverses when stratified
- โขCaused by confounding variable correlated with both groups and outcome
- โขBoth aggregate and subgroup numbers are mathematically correct
- โขResolution: stratify by the confounding variable
- โขCommon in causal inference, epidemiology, social science, observational studies
2. The Berkeley Admissions Example (Bickel, Hammel, O'Connell 1975)
In fall 1973, UC Berkeley admissions data appeared to show sex discrimination against women. Of male applicants, 44% were admitted; of female applicants, 35% were admitted. The 9-percentage-point gap suggested bias. When Bickel and colleagues stratified by department, the picture changed completely: | Department | Male Apps | Male Admit % | Female Apps | Female Admit % | |---|---|---|---|---| | A | 825 | 62% | 108 | 82% | | B | 560 | 63% | 25 | 68% | | C | 325 | 37% | 593 | 34% | | D | 417 | 33% | 375 | 35% | | E | 191 | 28% | 393 | 24% | | F | 373 | 6% | 341 | 7% | In each department, women were admitted at rates similar to or higher than men. So how did the aggregate show male advantage? **The lurking variable**: department. Women applied disproportionately to highly competitive departments (C, D, E with low admission rates) while men applied disproportionately to less competitive departments (A, B with high admission rates). The aggregate male admission rate was inflated by their concentration in easier departments; the female rate was deflated by their concentration in harder departments. Within each department, women were admitted equally or more โ there was no department-level discrimination. The aggregate appearance of bias was an artifact of differential application patterns combined with differential admission rates. **The deeper question**: why did women apply to harder departments? Possibly societal pressures, mentorship patterns, or different interest distributions. That's a separate sociological question. The statistical question โ was Berkeley's admission process biased? โ answers as 'no, after controlling for department.'
Key Points
- โขAggregate: 44% men admitted vs 35% women (suggested bias)
- โขStratified: women admitted at similar/higher rates in every department
- โขLurking variable: differential application rates by department
- โขWomen applied more to competitive departments (lower admission rates)
- โขAggregate appearance of bias was a Simpson's Paradox artifact
3. The Kidney Stone Treatment Example (Charig et al 1986)
Two treatments for kidney stones โ Treatment A (open surgery) and Treatment B (percutaneous nephrolithotomy). Aggregate success rates: - Treatment A: 273/350 = 78% - Treatment B: 289/350 = 83% Treatment B looks better. But stratify by stone size: | | Treatment A | Treatment B | |---|---|---| | Small stones (<2 cm) | 81/87 = 93% | 234/270 = 87% | | Large stones (โฅ2 cm) | 192/263 = 73% | 55/80 = 69% | Treatment A is BETTER in both subgroups: 93% vs 87% for small stones, 73% vs 69% for large stones. **The lurking variable**: stone size. Doctors used Treatment A more often on large stones (which have lower success rates regardless of treatment) and Treatment B more often on small stones (which have higher success rates regardless of treatment). The aggregate Treatment A rate was dragged down by its concentration in hard cases; aggregate Treatment B was inflated by its concentration in easy cases. For a patient deciding which treatment to use, the stratified data is far more informative than the aggregate. The aggregate would steer them to Treatment B, but if they have a small stone, Treatment A is actually slightly better; if they have a large stone, also Treatment A is better. **Why is this Simpson's Paradox not "just confounding"?** Both terms describe the same phenomenon, but Simpson's Paradox specifically refers to cases where the direction of the relationship REVERSES when stratifying โ not just where the magnitude changes. In Berkeley, the direction reversed (apparent male advantage โ female parity or advantage). In kidney stones, the direction reversed (B better aggregate โ A better in both strata). When the direction merely changes magnitude without reversing, it's still confounding but typically not called paradoxical.
Key Points
- โขAggregate: Treatment B 83% vs A 78% (B better)
- โขStratified: Treatment A better in both small and large stones
- โขLurking variable: stone size determines treatment choice
- โขTreatment A used more on hard cases; B used more on easy cases
- โขFor patient decision, stratified analysis is correct
4. The Math: Why Simpson's Paradox Happens
Simpson's Paradox arises when: 1. The two groups (Treatment A vs B) have DIFFERENT DISTRIBUTIONS across a confounding variable (stone size, department, age, etc.). 2. The confounding variable AFFECTS the outcome (large stones harder to treat, competitive departments lower admission rates). The aggregate combines two factors: the within-group rates AND the proportional weight assigned to each subgroup. When weights differ between groups, the aggregate doesn't represent any group's actual relationship cleanly. Mathematically, the aggregate is a weighted average: P(success | Treatment) = ฮฃแตข P(success | stratum i, Treatment) ร P(stratum i | Treatment). If stratum-i probabilities differ between treatments, the aggregate is biased toward the better-represented stratum. For Treatment A in kidney stones: aggregate = (87/350) ร 93% + (263/350) ร 73% = 0.249 ร 0.93 + 0.751 ร 0.73 = 0.232 + 0.548 = 0.780 = 78%. The 75% weight on large stones (low success) drags the aggregate down despite the strong 93% small-stone rate. For Treatment B: aggregate = (270/350) ร 87% + (80/350) ร 69% = 0.771 ร 0.87 + 0.229 ร 0.69 = 0.671 + 0.158 = 0.829 = 83%. The 77% weight on small stones (high success) pushes the aggregate up. The direction reversal (A worse in aggregate, A better in both strata) is mathematically possible because the weights differ enough.
Key Points
- โขParadox requires: confounding variable distributed differently across groups
- โขAND: confounding variable affects the outcome
- โขAggregate = weighted average of within-stratum rates
- โขWeights correspond to proportional representation in each stratum
- โขDirection reversal possible when weights differ enough
5. How to Detect and Resolve Simpson's Paradox
**Detection workflow:** 1. **Identify potential confounders**: any variable that's plausibly related to both the predictor and the outcome. For a treatment effect, think about: patient age, severity, comorbidities, time period, geography. For an admissions analysis, think about: department, application year, undergraduate institution. 2. **Check distribution across groups**: does the confounder distribute differently in Group A vs Group B? If yes, Simpson's Paradox is possible. 3. **Check within-stratum effects**: compute the outcome rate for Group A vs Group B within each stratum. If the within-stratum effect differs from the aggregate (especially in direction), you've found a Simpson's Paradox. 4. **Decide which level to report**: typically the stratified result is more informative for individual decisions. The aggregate may still matter for population-level questions (e.g., how many total admissions, how many total successes). **Statistical methods to handle confounding:** - **Stratification**: report separate results within each level of the confounder. - **Regression with covariates**: include the confounder as a control variable. - **Matching**: match treated and control units on the confounder, then compare. - **Inverse probability weighting**: reweight the sample so confounder distribution is balanced across groups. - **Causal inference frameworks**: directed acyclic graphs (DAGs), do-calculus, propensity scores. **The key intellectual move**: don't trust aggregate statistics until you've thought about plausible confounders. Especially in observational data (no random assignment), Simpson's Paradox is common and can flip your conclusion.
Key Points
- โขIdentify potential confounders before trusting aggregates
- โขCheck confounder distribution across comparison groups
- โขWithin-stratum effects vs aggregate effect can reverse direction
- โขStratified analysis usually more informative for individual decisions
- โขRegression, matching, IPW, propensity scores all address confounding
6. A Modern Example: COVID Vaccine Effectiveness
Some early data analyses of COVID vaccine effectiveness showed paradoxical patterns. In some country-level data, raw infection rates per 100,000 were actually HIGHER among the vaccinated population than among the unvaccinated. This was used (incorrectly) to argue against vaccine effectiveness. The lurking variable: AGE. Vaccinated populations tended to be much older on average (because vaccines were rolled out by age priority and uptake was higher in older adults). Older adults have higher infection susceptibility, more healthcare encounters where infections get diagnosed, and worse outcomes. Within each age stratum (e.g., 30-39, 40-49, 50-59, 60-69, 70+), vaccinated infection rates were substantially LOWER than unvaccinated rates. The aggregate appeared bad because the age distribution differed between vaccinated and unvaccinated populations. Properly age-adjusted analyses showed substantial vaccine effectiveness in every age stratum. The aggregate raw data was misleading. This is an active example of why stratified analysis matters in public health communication. Aggregated rates without age adjustment can produce headlines that say "vaccinated have higher infection rates" โ technically true, fundamentally misleading because of Simpson's Paradox.
Key Points
- โขCOVID data: aggregated rates appeared to show vaccinated as bad
- โขLurking variable: age (vaccinated populations older)
- โขWithin-age stratum: vaccines substantially effective
- โขAge-adjustment essential for valid public health analysis
- โขAggregated rates without adjustment commonly misleading
7. How StatsIQ Helps With Simpson's Paradox Problems
Snap a photo of a 2x2 or stratified contingency table and StatsIQ identifies whether the data exhibit Simpson's Paradox, computes the aggregate vs stratified rates, identifies the lurking variable causing the reversal, and walks through the resolution. Especially useful for biostatistics, epidemiology, and applied stats courses where Simpson's Paradox examples are common but often confusing on first encounter.
Key Points
- โขDetects when stratification reverses an aggregate trend
- โขComputes aggregate and within-stratum rates side by side
- โขIdentifies the likely lurking confounder
- โขWalks through resolution and reporting choice
- โขUseful for biostats, epi, and observational data analysis
8. When the Aggregate Is the Right Answer
The textbook case treats Simpson's Paradox as a warning that aggregates mislead โ but sometimes the aggregate IS the right answer. The decision depends on the causal structure. **Use stratified analysis when:** you're trying to answer 'what would happen if we changed the treatment for someone with stone size X' or 'is the admission process unfair to female applicants in department Y?' These are within-stratum causal questions. **Use aggregate analysis when:** you're trying to answer 'what's the overall success rate of this treatment in the populations where it's used?' or 'how many female applicants in total are admitted?' These are population-level descriptive questions. The direction-reversal concern applies to causal questions. Descriptive aggregates aren't paradoxical โ they're just answering a different question. **The deeper lesson**: causal inference requires careful thought about what's confounding what. Stratification is one tool; regression with controls, matching, instrumental variables, and randomization are others. The right method depends on the data and the question. Don't apply Simpson's Paradox warnings reflexively โ apply them when the question is causal and confounding is plausible. This content is for educational purposes only and does not constitute statistical advice.
Key Points
- โขStratified for causal questions about within-group effects
- โขAggregate for descriptive population-level summaries
- โขDirection reversal warning applies to causal claims, not descriptive
- โขSimpson's Paradox isn't about which level is correct โ it's about matching method to question
- โขApply causal inference tools when needed, not reflexively
Key Takeaways
- โ Simpson's Paradox = aggregate trend reverses when stratified
- โ Caused by confounding variable correlated with both group and outcome
- โ Both aggregate and stratified numbers are mathematically correct
- โ Resolution: stratify by the confounding variable
- โ Classic examples: Berkeley admissions, kidney stones, COVID vaccines
- โ Aggregate = weighted average of within-stratum rates
- โ Common in observational data; randomization typically prevents it
- โ Stratified analysis usually correct for individual decisions
Practice Questions
1. Two surgeons have these aggregate success rates: Surgeon A 90/100 = 90%, Surgeon B 80/100 = 80%. By case difficulty: Easy cases โ A 50/52 = 96%, B 38/40 = 95%. Hard cases โ A 40/48 = 83%, B 42/60 = 70%. Who is the better surgeon?
2. What's the lurking variable in the Berkeley admissions case?
3. Why does Simpson's Paradox occur more often in observational than in randomized data?
4. How do you resolve Simpson's Paradox?
FAQs
Common questions about this topic
Closely related but distinct. Confounding refers to any case where a third variable distorts the relationship between predictor and outcome. Simpson's Paradox specifically refers to cases where stratification REVERSES the direction of the relationship โ a particular and dramatic form of confounding. All Simpson's Paradox cases involve confounding, but not all confounding produces Simpson's Paradox (sometimes confounding only changes magnitude, not direction).
Yes. The continuous analog is when a positive correlation exists at the aggregate level but negative correlations exist within subgroups (or vice versa). Standard regression with appropriate controls handles this โ the partial regression coefficient can be opposite in sign from the simple correlation. Always check for plausible confounders in continuous-variable analyses too.
Depends on the question. For descriptive questions ('what fraction of all applicants were admitted?'), the aggregate answers it. For causal questions ('does department X discriminate?'), the stratified analysis answers it. For policy questions, both can matter โ the aggregate represents what's happening at the population level, the stratified shows where to intervene. Always think about what question you're actually answering.
Yes. Provide the data table (or photo of one), and StatsIQ identifies whether Simpson's Paradox is occurring, computes aggregate and stratified rates, identifies the likely lurking variable, and walks through the resolution. Especially useful for stats and epidemiology courses where Simpson's Paradox examples are common but often confusing on first encounter. This content is for educational purposes only and does not constitute statistical advice.