🔗

advancedadvanced35-45 min

Introduction to Causal Inference: Why Correlation Is Not Enough and What Actually Establishes Causation

A rigorous introduction to causal inference covering why observational correlations mislead, the potential outcomes framework, confounding, DAGs, and the key methods — randomized experiments, difference-in-differences, instrumental variables, and regression discontinuity — that let researchers make causal claims from imperfect data.

What You'll Learn

✓Explain why correlation between X and Y does not establish that X causes Y, using specific mechanisms (confounding, reverse causation, collider bias)
✓Define the potential outcomes framework and the fundamental problem of causal inference
✓Use directed acyclic graphs (DAGs) to identify confounders, mediators, and colliders in a causal structure
✓Describe 4 major causal inference methods and the assumptions each requires to produce valid estimates

1. Why Correlation Fails: The Three Ways It Goes Wrong

Everyone knows "correlation doesn't imply causation." Far fewer people can explain precisely why. There are exactly three mechanisms that generate a statistical association between X and Y without X actually causing Y. Understanding all three is the foundation of causal inference. First, confounding. A third variable Z causes both X and Y, creating a spurious association between them. Classic example: ice cream sales and drowning deaths are positively correlated. Ice cream doesn't cause drowning. Hot weather (the confounder) increases both ice cream consumption and swimming, which increases drowning risk. In the data, X and Y move together — but only because Z is pulling both strings. Roughly 70-80% of causal inference problems you'll encounter in coursework involve confounding. Second, reverse causation. Y actually causes X, not the other way around. Countries with more hospitals have higher death rates. Do hospitals cause death? No — countries build more hospitals because they have sicker populations. The arrow points the other direction. Cross-sectional data (one snapshot in time) often can't distinguish X→Y from Y→X. Longitudinal data (repeated measurements over time) helps, but even temporal ordering isn't definitive proof of causation. Third, collider bias (also called Berkson's paradox or selection bias). This one's more subtle. A collider is a variable that is caused by both X and Y. If you condition on (control for, stratify by, or select based on) the collider, you create a spurious association between X and Y even when none exists. Example: among hospitalized patients, there appears to be a negative association between flu and broken legs. Not because flu protects against fractures — but because a patient had to have one or the other (or both) to be hospitalized. Hospitalization is the collider. By restricting to hospitalized patients, you've induced an artificial relationship. Collider bias trips up even experienced researchers because controlling for more variables feels like it should always make results more accurate — but controlling for a collider makes things worse. These three mechanisms are exhaustive. If X and Y are correlated but X doesn't cause Y, one (or more) of these three things is happening. Causal inference methods exist to rule out or account for all three.

Key Points

•Three mechanisms generate non-causal associations: confounding (shared cause), reverse causation (arrow flipped), and collider bias (conditioning on a shared effect)
•Confounding is the most common threat — a third variable Z drives both X and Y, creating a spurious correlation
•Collider bias is counterintuitive: controlling for a variable that X and Y both cause creates a false association
•If X and Y are correlated but X doesn't cause Y, at least one of these three mechanisms must be operating

2. The Potential Outcomes Framework: What Causation Actually Means

To reason rigorously about causation, you need a formal definition. The potential outcomes framework (also called the Rubin causal model, after Donald Rubin) provides one. The idea: for any individual, there are two potential outcomes. Y(1) is the outcome if they receive the treatment. Y(0) is the outcome if they don't. The individual causal effect is Y(1) - Y(0). For a patient considering surgery: Y(1) is their health if they get surgery, Y(0) is their health if they don't. The causal effect of surgery for that patient is the difference. Here's the fundamental problem of causal inference: you can never observe both potential outcomes for the same individual. A patient either gets surgery or doesn't. You see Y(1) or Y(0), never both. The unobserved potential outcome is called the counterfactual — what would have happened under the alternative scenario. Causal inference is fundamentally about estimating something you cannot directly observe. Since individual causal effects are unobservable, we focus on the Average Treatment Effect (ATE): E[Y(1) - Y(0)] across the population. In a randomized experiment, random assignment ensures that the treatment and control groups are comparable on average — so the difference in group means estimates the ATE. This works because randomization makes treatment assignment independent of the potential outcomes. The treated group's average Y(1) is a valid estimate of what the whole population's Y(1) would be, and similarly for control. In observational data, treatment isn't randomly assigned. People who choose surgery may be healthier (or sicker) than those who don't. The treatment and control groups differ in ways that affect the outcome, so the simple difference in means conflates the treatment effect with pre-existing differences. Every observational causal inference method is trying to recreate the conditions of a randomized experiment — making treated and untreated groups comparable — using assumptions and statistical adjustments instead of actual randomization. A key assumption for most methods: the Stable Unit Treatment Value Assumption (SUTVA). This requires that one person's treatment doesn't affect another person's outcome (no interference) and that there's only one version of each treatment level. Vaccination studies violate SUTVA because your vaccination affects your neighbor's health through herd immunity. When SUTVA fails, the potential outcomes framework needs extension.

Key Points

•The individual causal effect is Y(1) - Y(0), but you can never observe both potential outcomes for the same person
•The fundamental problem of causal inference: counterfactuals are inherently unobservable
•Randomized experiments solve this by making treatment groups comparable on average through random assignment
•Every observational causal method tries to approximate randomization using assumptions and statistical adjustments

3. DAGs and Confounding: Seeing the Causal Structure

Directed Acyclic Graphs (DAGs) are the visual language of causal inference. A DAG is a diagram where nodes represent variables and arrows represent direct causal effects. "Directed" means the arrows point in one direction (from cause to effect). "Acyclic" means there are no loops — you can't follow the arrows in a circle back to where you started. DAGs aren't just pretty pictures. They encode testable assumptions about the causal structure, and they tell you exactly which variables to control for (and which to leave alone) to identify a causal effect. This matters because thoughtlessly throwing every available variable into a regression model — something researchers do constantly — can introduce bias rather than remove it. Three building blocks of DAGs. A fork: Z → X and Z → Y. Z is a common cause (confounder). X and Y are correlated because of Z. Controlling for Z removes the confounding and isolates the causal path from X to Y (if one exists). A chain: X → M → Y. M is a mediator. X causes M, which causes Y. Controlling for M blocks the causal path from X to Y. If you want the total effect of X on Y, do NOT control for M. If you want the direct effect (not through M), then control for M. A collider: X → C ← Y. C is caused by both X and Y. X and Y are NOT associated unless you condition on C. If you do condition on C, you create a spurious association. Do NOT control for colliders. The practical rule: control for confounders (forks), don't control for mediators (unless you want direct effects), and never control for colliders. DAGs make these decisions visual and systematic. Example: You want to estimate the effect of education (X) on income (Y). Parents' socioeconomic status (Z) affects both education and income — it's a confounder (fork). You should control for it. Job type (M) is caused by education and causes income — it's a mediator (chain). If you want the total effect of education on income, don't control for it. If you control for job type, you only see the effect of education that operates through channels other than job placement. College selectivity (C) is caused by both education effort and family wealth (which also affects income) — depending on the DAG structure, controlling for it might open a collider path. Drawing the DAG forces you to state your assumptions explicitly. Two researchers can disagree about the DAG (is Z a confounder or a mediator?) and have a productive argument about the causal structure rather than blindly running regressions.

Key Points

•DAGs encode causal assumptions visually: nodes are variables, arrows point from cause to effect
•Forks (confounders): control for them. Chains (mediators): don't control unless you want direct effects only. Colliders: never control for them.
•Controlling for a collider induces bias — more controls is NOT always better
•The "back-door criterion" identifies the minimal set of variables to control for to block all confounding paths without introducing new bias

4. Methods That Actually Establish Causation

When randomization isn't possible (which is most of the time outside clinical trials), four quasi-experimental methods let researchers make credible causal claims from observational data. Each exploits a different source of variation to approximate random assignment. Randomized Controlled Trials (RCTs) remain the gold standard. Randomly assign subjects to treatment vs. control, measure outcomes, compare. Randomization ensures that all confounders — measured and unmeasured — are balanced on average across groups. The simple difference in means is an unbiased estimate of the causal effect. RCTs aren't always feasible (you can't randomly assign smoking, poverty, or education), ethical (you can't withhold a known effective treatment), or practical (they're expensive and time-consuming). That's where the next three methods come in. Difference-in-Differences (DiD) compares the change in outcomes over time between a group that received treatment and a group that didn't. The key assumption is "parallel trends": absent the treatment, both groups would have followed the same trajectory. Example: a state raises the minimum wage on January 1. You compare the employment change in that state (before vs. after) to the employment change in a neighboring state that didn't raise the wage. If both states were trending similarly before the policy, the difference in their changes estimates the causal effect. DiD is powerful because it controls for all time-invariant confounders (geography, culture, demographics that don't change) and all common time trends (national economic shifts). It fails when the parallel trends assumption is violated — if the treatment state was already on a different trajectory before the policy. Instrumental Variables (IV) use a third variable (the instrument) that affects the treatment but has no direct effect on the outcome except through the treatment. Example: distance to the nearest college affects whether someone attends college (the treatment) but arguably doesn't directly affect future earnings (the outcome) except through its effect on college attendance. If the instrument is valid, comparing outcomes across values of the instrument reveals the causal effect of the treatment. IV requires two assumptions: relevance (the instrument actually affects treatment — testable) and exclusion (the instrument only affects the outcome through the treatment — not directly testable and often debatable). Weak instruments (small effect on treatment) produce unreliable estimates with enormous standard errors. Regression Discontinuity (RD) exploits a threshold rule. If treatment is assigned based on whether a continuous variable crosses a cutoff — students scoring above 70 get a scholarship, patients with blood pressure above 140 get medication — then individuals just above and just below the cutoff are nearly identical except for their treatment status. Comparing outcomes in this narrow band around the cutoff estimates the causal effect of treatment at the cutoff. RD is one of the most credible quasi-experimental designs because the assumption (people just above and below the threshold are comparable) is often very plausible. The limitation: it only estimates the effect at the cutoff, not for the entire population. StatsIQ helps you build intuition for these methods with practice problems that present research scenarios and ask you to identify which causal inference method applies, state the required assumptions, and evaluate whether the assumptions are plausible.

Key Points

•RCTs are the gold standard: randomization balances all confounders (measured and unmeasured) on average
•Difference-in-Differences compares changes over time between treated and control groups — assumes parallel pre-treatment trends
•Instrumental Variables use a variable that affects treatment but not the outcome directly — requires relevance and exclusion restrictions
•Regression Discontinuity exploits threshold-based treatment assignment — highly credible but estimates effects only at the cutoff

Key Takeaways

★The fundamental problem of causal inference: you can never observe both potential outcomes Y(1) and Y(0) for the same individual
★Confounding, reverse causation, and collider bias are the three (and only three) mechanisms that generate non-causal correlations
★Controlling for a collider variable creates bias — more control variables is not always better
★Randomized experiments work because random assignment makes treatment independent of potential outcomes, not because of large sample sizes
★Difference-in-differences requires the parallel trends assumption: without treatment, both groups would have followed the same trajectory
★Instrumental variables estimate a Local Average Treatment Effect (LATE) — the effect for "compliers" only, not the entire population
★Regression discontinuity is often considered the most credible quasi-experimental design because the identifying assumption is testable near the cutoff

Practice Questions

1. A study finds that people who eat breakfast daily have lower BMI than those who skip breakfast. A news headline reads: "Eating breakfast helps you lose weight." Identify the causal inference problem.

This is a textbook confounding problem. People who eat breakfast may differ systematically from those who skip it — they may be more health-conscious overall, exercise more, sleep more regularly, and make better food choices throughout the day. These confounders (health consciousness, lifestyle habits) cause both breakfast eating and lower BMI. The correlation between breakfast and BMI doesn't establish that breakfast itself causes weight loss. An RCT randomly assigning people to eat or skip breakfast would be needed to isolate the causal effect.

2. A researcher studying the effect of college education on earnings controls for occupation in a regression. A colleague argues this is a mistake. Who is correct and why?

The colleague is correct. Occupation is a mediator — education affects occupation, which in turn affects earnings. Controlling for occupation blocks the causal path from education to earnings that operates through job placement, which is arguably the main mechanism. The regression would only capture the effect of education on earnings that operates through channels other than occupation (like signaling or networks), underestimating the total causal effect. If the goal is the total effect of education on earnings, occupation should not be included as a control.

3. A state implements a new reading program in January 2024. A researcher compares reading scores in December 2023 (before) vs. June 2024 (after) in the treated state only. What causal inference method is this, and what is missing?

This is a simple before-after comparison, not a valid causal design. Reading scores naturally change over a school year regardless of any program. Without a control group (an untreated state with similar pre-trends), it's impossible to separate the program's effect from normal maturation, seasonal effects, or other concurrent changes. A proper difference-in-differences design requires both the time comparison (before vs. after) AND a group comparison (treated vs. untreated), plus evidence that the parallel trends assumption holds in the pre-treatment period.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Yes, but it requires strong assumptions and careful methodology. Methods like difference-in-differences, instrumental variables, and regression discontinuity can produce credible causal estimates from non-experimental data when their specific assumptions are met. The credibility of the causal claim depends entirely on the plausibility of those assumptions — which is why causal inference papers spend extensive space arguing for their validity. No statistical technique alone can prove causation; the argument is always partly statistical and partly substantive.

A confounder is a common cause of both the treatment and the outcome — it sits outside the causal pathway and creates a spurious association. You should control for it. A mediator sits on the causal pathway between treatment and outcome — treatment causes the mediator, which causes the outcome. Controlling for a mediator blocks the causal path and underestimates the total effect. The distinction depends on your causal model (DAG), not on the data alone. The same variable can be a confounder in one research question and a mediator in another.

Yes. StatsIQ includes problems that ask you to identify causal structures in research scenarios, draw and interpret DAGs, select the appropriate causal inference method for a given situation, and evaluate whether the required assumptions are plausible. These are the kinds of questions that appear on advanced statistics and econometrics exams.

Related Study Guides

🔗 fundamentals

Browse All Study Guides

🎯 AP Statistics 🔬 Introduction to 📈 Regression Analysis 🎲 Probability Foundations 📊 Understanding Statistical 🧪 ANOVA and 📉 Data Visualization 🔄 Bayesian vs 📊 What Is 📐 What Is 🔗 Correlation vs 📐 Central Limit 📏 Confidence Intervals:📐 P-Values and 📐 Chi-Square Tests ⚠️ Type I 🎲 Sampling Methods 📈 Introduction to 📏 Effect Size 📉 Multiple Regression:🔀 Non-Parametric Tests:🎯 How to 🧪 A/B Testing 🧹 Data Cleaning ⏱️ Survival Analysis:🔗 Introduction to 📈 Time Series 🔬 Principal Component 🔀 How to 📐 Two-Sample t-Test 📊 How to 🔀 Paired vs 📋 How to 📊 Z-Scores and 📈 R Squared 🎲 Binomial Probability 🎲 Expected Value 📐 Standard Error 🎯 Margin of 📊 Contingency Tables 📉 Poisson Distribution:📏 Cohen's d 🔗 Pearson vs ⚖️ One-Tailed vs 🔔 Normal Distribution 📉 Linear Regression 📊 Mean vs 🎯 Confidence vs 📊 Two-Way ANOVA:⚡ Statistical Power 🎯 Conditional Probability 🎲 Permutations vs 📈 Log Transformations 🔄 Simpson's Paradox:🧪 Hypothesis Testing:🎲 Probability Distributions:📈 Central Limit ⚖️ Type I 🎯 P-Value Interpretation:↔️ One-Tailed vs 🎲 Binomial vs 📊 Normal Distribution 📈 Discrete vs 📊 Chi-Square Goodness-of-Fit 🔬 Mann-Whitney U ⏱️ Exponential Distribution:🎯 Geometric vs 🎯 Wilcoxon Signed-Rank 🎯 Kruskal-Wallis Test 🎯 Tukey HSD 🎯 Relative Risk 🔁 Friedman Test 📈 Spearman vs 🎚️ Bonferroni vs 🎯 Confidence vs ⚡ A-Priori vs

Introduction to Causal Inference: Why Correlation Is Not Enough and What Actually Establishes Causation

What You'll Learn

1. Why Correlation Fails: The Three Ways It Goes Wrong

Key Points

2. The Potential Outcomes Framework: What Causation Actually Means

Key Points

3. DAGs and Confounding: Seeing the Causal Structure

Key Points

4. Methods That Actually Establish Causation

Key Points

Key Takeaways

Practice Questions

Study with AI

FAQs

Is it ever possible to establish causation from observational data?

What is the difference between a confounder and a mediator?

Can StatsIQ help me learn causal inference?

Related Study Guides

Correlation vs Causation: What Is the Difference?

Regression Analysis Complete Guide

A/B Testing Done Right: Experiment Design, Sample Size, and Avoiding False Discoveries

Sampling Methods Explained: Random, Stratified, Cluster, and When to Use Each

Browse All Study Guides