๐Ÿ”„
fundamentalsintermediate25 min

Simpson's Paradox: Confounding Variables and Stratification (Worked Examples)

How to recognize, diagnose, and resolve Simpson's Paradox โ€” when an aggregated trend reverses or disappears within subgroups. Covers the classic Berkeley admissions case, kidney stone treatment data, and the stratification analysis that reveals the true relationship.

What You'll Learn

  • โœ“Recognize when Simpson's Paradox is occurring
  • โœ“Identify the lurking variable causing the reversal
  • โœ“Apply stratification to reveal the true within-group relationships
  • โœ“Decide which level of analysis (aggregated or stratified) to report

1. Direct Answer: What Simpson's Paradox Is

Simpson's Paradox occurs when a trend visible in aggregated data reverses or disappears when the data is broken down by a confounding variable. The reversal is real โ€” both the aggregate trend and the subgroup trend are mathematically correct โ€” and the paradox arises from differential sample sizes across subgroups combined with different baseline rates. The key insight: aggregated statistics can be misleading when the groups being aggregated have different distributions of a confounding variable. Resolution requires identifying the confounding variable and analyzing within strata. The three classic examples: UC Berkeley graduate admissions in 1973 (looked sex-discriminatory in aggregate but each department was actually fair when analyzed separately), kidney stone treatment data (Treatment A appeared worse overall but better for both small and large stones), and various COVID vaccine effectiveness analyses (raw infection rates can mislead when age strata differ).

Key Points

  • โ€ขSimpson's Paradox = aggregated trend reverses when stratified
  • โ€ขCaused by confounding variable correlated with both groups and outcome
  • โ€ขBoth aggregate and subgroup numbers are mathematically correct
  • โ€ขResolution: stratify by the confounding variable
  • โ€ขCommon in causal inference, epidemiology, social science, observational studies

2. The Berkeley Admissions Example (Bickel, Hammel, O'Connell 1975)

In fall 1973, UC Berkeley admissions data appeared to show sex discrimination against women. Of male applicants, 44% were admitted; of female applicants, 35% were admitted. The 9-percentage-point gap suggested bias. When Bickel and colleagues stratified by department, the picture changed completely: | Department | Male Apps | Male Admit % | Female Apps | Female Admit % | |---|---|---|---|---| | A | 825 | 62% | 108 | 82% | | B | 560 | 63% | 25 | 68% | | C | 325 | 37% | 593 | 34% | | D | 417 | 33% | 375 | 35% | | E | 191 | 28% | 393 | 24% | | F | 373 | 6% | 341 | 7% | In each department, women were admitted at rates similar to or higher than men. So how did the aggregate show male advantage? **The lurking variable**: department. Women applied disproportionately to highly competitive departments (C, D, E with low admission rates) while men applied disproportionately to less competitive departments (A, B with high admission rates). The aggregate male admission rate was inflated by their concentration in easier departments; the female rate was deflated by their concentration in harder departments. Within each department, women were admitted equally or more โ€” there was no department-level discrimination. The aggregate appearance of bias was an artifact of differential application patterns combined with differential admission rates. **The deeper question**: why did women apply to harder departments? Possibly societal pressures, mentorship patterns, or different interest distributions. That's a separate sociological question. The statistical question โ€” was Berkeley's admission process biased? โ€” answers as 'no, after controlling for department.'

Key Points

  • โ€ขAggregate: 44% men admitted vs 35% women (suggested bias)
  • โ€ขStratified: women admitted at similar/higher rates in every department
  • โ€ขLurking variable: differential application rates by department
  • โ€ขWomen applied more to competitive departments (lower admission rates)
  • โ€ขAggregate appearance of bias was a Simpson's Paradox artifact

3. The Kidney Stone Treatment Example (Charig et al 1986)

Two treatments for kidney stones โ€” Treatment A (open surgery) and Treatment B (percutaneous nephrolithotomy). Aggregate success rates: - Treatment A: 273/350 = 78% - Treatment B: 289/350 = 83% Treatment B looks better. But stratify by stone size: | | Treatment A | Treatment B | |---|---|---| | Small stones (<2 cm) | 81/87 = 93% | 234/270 = 87% | | Large stones (โ‰ฅ2 cm) | 192/263 = 73% | 55/80 = 69% | Treatment A is BETTER in both subgroups: 93% vs 87% for small stones, 73% vs 69% for large stones. **The lurking variable**: stone size. Doctors used Treatment A more often on large stones (which have lower success rates regardless of treatment) and Treatment B more often on small stones (which have higher success rates regardless of treatment). The aggregate Treatment A rate was dragged down by its concentration in hard cases; aggregate Treatment B was inflated by its concentration in easy cases. For a patient deciding which treatment to use, the stratified data is far more informative than the aggregate. The aggregate would steer them to Treatment B, but if they have a small stone, Treatment A is actually slightly better; if they have a large stone, also Treatment A is better. **Why is this Simpson's Paradox not "just confounding"?** Both terms describe the same phenomenon, but Simpson's Paradox specifically refers to cases where the direction of the relationship REVERSES when stratifying โ€” not just where the magnitude changes. In Berkeley, the direction reversed (apparent male advantage โ†’ female parity or advantage). In kidney stones, the direction reversed (B better aggregate โ†’ A better in both strata). When the direction merely changes magnitude without reversing, it's still confounding but typically not called paradoxical.

Key Points

  • โ€ขAggregate: Treatment B 83% vs A 78% (B better)
  • โ€ขStratified: Treatment A better in both small and large stones
  • โ€ขLurking variable: stone size determines treatment choice
  • โ€ขTreatment A used more on hard cases; B used more on easy cases
  • โ€ขFor patient decision, stratified analysis is correct

4. The Math: Why Simpson's Paradox Happens

Simpson's Paradox arises when: 1. The two groups (Treatment A vs B) have DIFFERENT DISTRIBUTIONS across a confounding variable (stone size, department, age, etc.). 2. The confounding variable AFFECTS the outcome (large stones harder to treat, competitive departments lower admission rates). The aggregate combines two factors: the within-group rates AND the proportional weight assigned to each subgroup. When weights differ between groups, the aggregate doesn't represent any group's actual relationship cleanly. Mathematically, the aggregate is a weighted average: P(success | Treatment) = ฮฃแตข P(success | stratum i, Treatment) ร— P(stratum i | Treatment). If stratum-i probabilities differ between treatments, the aggregate is biased toward the better-represented stratum. For Treatment A in kidney stones: aggregate = (87/350) ร— 93% + (263/350) ร— 73% = 0.249 ร— 0.93 + 0.751 ร— 0.73 = 0.232 + 0.548 = 0.780 = 78%. The 75% weight on large stones (low success) drags the aggregate down despite the strong 93% small-stone rate. For Treatment B: aggregate = (270/350) ร— 87% + (80/350) ร— 69% = 0.771 ร— 0.87 + 0.229 ร— 0.69 = 0.671 + 0.158 = 0.829 = 83%. The 77% weight on small stones (high success) pushes the aggregate up. The direction reversal (A worse in aggregate, A better in both strata) is mathematically possible because the weights differ enough.

Key Points

  • โ€ขParadox requires: confounding variable distributed differently across groups
  • โ€ขAND: confounding variable affects the outcome
  • โ€ขAggregate = weighted average of within-stratum rates
  • โ€ขWeights correspond to proportional representation in each stratum
  • โ€ขDirection reversal possible when weights differ enough

5. How to Detect and Resolve Simpson's Paradox

**Detection workflow:** 1. **Identify potential confounders**: any variable that's plausibly related to both the predictor and the outcome. For a treatment effect, think about: patient age, severity, comorbidities, time period, geography. For an admissions analysis, think about: department, application year, undergraduate institution. 2. **Check distribution across groups**: does the confounder distribute differently in Group A vs Group B? If yes, Simpson's Paradox is possible. 3. **Check within-stratum effects**: compute the outcome rate for Group A vs Group B within each stratum. If the within-stratum effect differs from the aggregate (especially in direction), you've found a Simpson's Paradox. 4. **Decide which level to report**: typically the stratified result is more informative for individual decisions. The aggregate may still matter for population-level questions (e.g., how many total admissions, how many total successes). **Statistical methods to handle confounding:** - **Stratification**: report separate results within each level of the confounder. - **Regression with covariates**: include the confounder as a control variable. - **Matching**: match treated and control units on the confounder, then compare. - **Inverse probability weighting**: reweight the sample so confounder distribution is balanced across groups. - **Causal inference frameworks**: directed acyclic graphs (DAGs), do-calculus, propensity scores. **The key intellectual move**: don't trust aggregate statistics until you've thought about plausible confounders. Especially in observational data (no random assignment), Simpson's Paradox is common and can flip your conclusion.

Key Points

  • โ€ขIdentify potential confounders before trusting aggregates
  • โ€ขCheck confounder distribution across comparison groups
  • โ€ขWithin-stratum effects vs aggregate effect can reverse direction
  • โ€ขStratified analysis usually more informative for individual decisions
  • โ€ขRegression, matching, IPW, propensity scores all address confounding

6. A Modern Example: COVID Vaccine Effectiveness

Some early data analyses of COVID vaccine effectiveness showed paradoxical patterns. In some country-level data, raw infection rates per 100,000 were actually HIGHER among the vaccinated population than among the unvaccinated. This was used (incorrectly) to argue against vaccine effectiveness. The lurking variable: AGE. Vaccinated populations tended to be much older on average (because vaccines were rolled out by age priority and uptake was higher in older adults). Older adults have higher infection susceptibility, more healthcare encounters where infections get diagnosed, and worse outcomes. Within each age stratum (e.g., 30-39, 40-49, 50-59, 60-69, 70+), vaccinated infection rates were substantially LOWER than unvaccinated rates. The aggregate appeared bad because the age distribution differed between vaccinated and unvaccinated populations. Properly age-adjusted analyses showed substantial vaccine effectiveness in every age stratum. The aggregate raw data was misleading. This is an active example of why stratified analysis matters in public health communication. Aggregated rates without age adjustment can produce headlines that say "vaccinated have higher infection rates" โ€” technically true, fundamentally misleading because of Simpson's Paradox.

Key Points

  • โ€ขCOVID data: aggregated rates appeared to show vaccinated as bad
  • โ€ขLurking variable: age (vaccinated populations older)
  • โ€ขWithin-age stratum: vaccines substantially effective
  • โ€ขAge-adjustment essential for valid public health analysis
  • โ€ขAggregated rates without adjustment commonly misleading

7. How StatsIQ Helps With Simpson's Paradox Problems

Snap a photo of a 2x2 or stratified contingency table and StatsIQ identifies whether the data exhibit Simpson's Paradox, computes the aggregate vs stratified rates, identifies the lurking variable causing the reversal, and walks through the resolution. Especially useful for biostatistics, epidemiology, and applied stats courses where Simpson's Paradox examples are common but often confusing on first encounter.

Key Points

  • โ€ขDetects when stratification reverses an aggregate trend
  • โ€ขComputes aggregate and within-stratum rates side by side
  • โ€ขIdentifies the likely lurking confounder
  • โ€ขWalks through resolution and reporting choice
  • โ€ขUseful for biostats, epi, and observational data analysis

8. When the Aggregate Is the Right Answer

The textbook case treats Simpson's Paradox as a warning that aggregates mislead โ€” but sometimes the aggregate IS the right answer. The decision depends on the causal structure. **Use stratified analysis when:** you're trying to answer 'what would happen if we changed the treatment for someone with stone size X' or 'is the admission process unfair to female applicants in department Y?' These are within-stratum causal questions. **Use aggregate analysis when:** you're trying to answer 'what's the overall success rate of this treatment in the populations where it's used?' or 'how many female applicants in total are admitted?' These are population-level descriptive questions. The direction-reversal concern applies to causal questions. Descriptive aggregates aren't paradoxical โ€” they're just answering a different question. **The deeper lesson**: causal inference requires careful thought about what's confounding what. Stratification is one tool; regression with controls, matching, instrumental variables, and randomization are others. The right method depends on the data and the question. Don't apply Simpson's Paradox warnings reflexively โ€” apply them when the question is causal and confounding is plausible. This content is for educational purposes only and does not constitute statistical advice.

Key Points

  • โ€ขStratified for causal questions about within-group effects
  • โ€ขAggregate for descriptive population-level summaries
  • โ€ขDirection reversal warning applies to causal claims, not descriptive
  • โ€ขSimpson's Paradox isn't about which level is correct โ€” it's about matching method to question
  • โ€ขApply causal inference tools when needed, not reflexively

Key Takeaways

  • โ˜…Simpson's Paradox = aggregate trend reverses when stratified
  • โ˜…Caused by confounding variable correlated with both group and outcome
  • โ˜…Both aggregate and stratified numbers are mathematically correct
  • โ˜…Resolution: stratify by the confounding variable
  • โ˜…Classic examples: Berkeley admissions, kidney stones, COVID vaccines
  • โ˜…Aggregate = weighted average of within-stratum rates
  • โ˜…Common in observational data; randomization typically prevents it
  • โ˜…Stratified analysis usually correct for individual decisions

Practice Questions

1. Two surgeons have these aggregate success rates: Surgeon A 90/100 = 90%, Surgeon B 80/100 = 80%. By case difficulty: Easy cases โ€” A 50/52 = 96%, B 38/40 = 95%. Hard cases โ€” A 40/48 = 83%, B 42/60 = 70%. Who is the better surgeon?
Surgeon B is better in both subgroups: 95% vs 96% (close but A slightly better on easy)... actually let me recheck. Surgeon A: 96% on easy, 83% on hard. Surgeon B: 95% on easy, 70% on hard. Surgeon A is better in both strata. The aggregate appears to favor A as well (90% vs 80%) but it would also favor A even if the case mix were equal โ€” so this example is consistent rather than paradoxical. To create true Simpson's Paradox, the aggregate would need to favor B while the strata favor A.
2. What's the lurking variable in the Berkeley admissions case?
Department. Women applied disproportionately to competitive departments (lower admission rates regardless of sex), while men applied disproportionately to less competitive departments (higher admission rates). When stratified by department, women were admitted at similar or higher rates than men in every department.
3. Why does Simpson's Paradox occur more often in observational than in randomized data?
In randomized experiments, the random assignment ensures that potential confounders distribute roughly equally across treatment groups. With balanced confounder distribution, aggregate effects represent true effects without distortion. In observational data, confounders distribute unevenly across groups (treatment is determined by patient/subject choice or doctor selection, not random), so aggregate effects can mislead. Randomization is the "gold standard" partly because it eliminates this concern.
4. How do you resolve Simpson's Paradox?
Identify the confounding variable, then stratify the analysis by levels of that variable and report within-stratum effects. Alternatively, use regression with the confounder as a control variable, matching on the confounder, propensity score methods, or inverse probability weighting. The goal is to compare like with like โ€” within levels of the confounder so you're not comparing apples to oranges in the aggregate.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Closely related but distinct. Confounding refers to any case where a third variable distorts the relationship between predictor and outcome. Simpson's Paradox specifically refers to cases where stratification REVERSES the direction of the relationship โ€” a particular and dramatic form of confounding. All Simpson's Paradox cases involve confounding, but not all confounding produces Simpson's Paradox (sometimes confounding only changes magnitude, not direction).

Yes. The continuous analog is when a positive correlation exists at the aggregate level but negative correlations exist within subgroups (or vice versa). Standard regression with appropriate controls handles this โ€” the partial regression coefficient can be opposite in sign from the simple correlation. Always check for plausible confounders in continuous-variable analyses too.

Depends on the question. For descriptive questions ('what fraction of all applicants were admitted?'), the aggregate answers it. For causal questions ('does department X discriminate?'), the stratified analysis answers it. For policy questions, both can matter โ€” the aggregate represents what's happening at the population level, the stratified shows where to intervene. Always think about what question you're actually answering.

Yes. Provide the data table (or photo of one), and StatsIQ identifies whether Simpson's Paradox is occurring, computes aggregate and stratified rates, identifies the likely lurking variable, and walks through the resolution. Especially useful for stats and epidemiology courses where Simpson's Paradox examples are common but often confusing on first encounter. This content is for educational purposes only and does not constitute statistical advice.

More Study Guides