📉
advancedadvanced35 min

Linear Regression Assumptions: How to Check Residuals, Homoscedasticity, and Normality

Linear regression gives you coefficients and p-values — but those are only trustworthy if the underlying assumptions hold. This guide walks through the five key assumptions, how to check each using residual plots and diagnostic tests, and what to do when assumptions are violated.

What You'll Learn

  • Identify the five key assumptions of linear regression
  • Interpret residual plots, Q-Q plots, and scale-location plots
  • Test for homoscedasticity using Breusch-Pagan and other methods
  • Detect and correct multicollinearity using VIF
  • Apply transformations and alternative approaches when assumptions fail

1. Direct Answer: Five Assumptions in One Paragraph

Linear regression assumes five things about the data: (1) LINEARITY — the relationship between predictors and response is linear, not curved; (2) INDEPENDENCE — observations are independent of each other; (3) HOMOSCEDASTICITY — the variance of errors is constant across all predictor values; (4) NORMALITY — the errors (residuals) are approximately normally distributed; (5) NO MULTICOLLINEARITY — predictors in multiple regression are not too highly correlated with each other. Checking these assumptions is the diagnostic phase of regression analysis. The tools are: - Residuals vs fitted plot: checks linearity and homoscedasticity - Q-Q plot of residuals: checks normality - Scale-location plot: checks homoscedasticity more rigorously - Residuals vs leverage (influence plot): detects influential outliers - Variance Inflation Factor (VIF): detects multicollinearity - Durbin-Watson: checks independence of residuals (time series) If assumptions are violated, regression coefficients and p-values may be biased, inefficient, or misleading. The correct response is to identify which assumption is violated and apply the appropriate fix: transform variables, use robust standard errors, use generalized least squares, or switch to a different model (logistic regression for binary outcomes, quantile regression for skewed data). This content is for educational purposes only and supports statistics student learning.

Key Points

  • Five assumptions: linearity, independence, homoscedasticity, normality, no multicollinearity
  • Residual plots are the primary diagnostic tool
  • Q-Q plot checks normality; scale-location checks homoscedasticity
  • VIF detects multicollinearity (VIF > 5-10 is problematic)
  • Violations require diagnostic response, not just reporting

2. Assumption 1: Linearity

The response variable must have a linear relationship with each predictor. For a curved relationship, linear regression produces biased coefficient estimates and misses the true pattern. How to check: plot residuals against fitted values. If the relationship is linear, residuals should scatter randomly around zero with no pattern. A curve (like a U shape, inverted U, or systematic trend) indicates non-linearity. Worked example: suppose you fit a linear regression and the residuals vs fitted plot shows residuals positive for low and high fitted values, negative for middle fitted values. This is a U-shape — your actual relationship is probably quadratic, not linear. Responses to non-linearity: 1. Transform the response: try log(y) or sqrt(y) if y is positively skewed. These compress the high end and can linearize exponential or power relationships. 2. Transform predictors: try log(x) or x² for relationships that curve. Polynomial regression (adding x² as a predictor) models quadratic relationships directly. 3. Use splines or GAMs: for complex non-linear patterns, generalized additive models (GAMs) fit smooth curves instead of straight lines. 4. Switch to a non-linear model: exponential growth, logistic saturation, power functions — each has its own appropriate model. Important: transforming variables changes the interpretation of coefficients. A coefficient of 0.5 in a log(y) ~ x model means 'a one-unit increase in x is associated with a 50% increase in y' (because log × 100 ≈ percentage). Report on the transformed scale clearly.

Key Points

  • Linearity violated if residuals show systematic pattern (U-shape, trend)
  • Check: residuals vs fitted plot should be random around zero
  • Fix: transform response (log, sqrt), transform predictors (log, polynomial), or use nonlinear models
  • Transformed coefficients have transformed interpretations
  • Splines/GAMs fit smooth curves for complex non-linear patterns

3. Assumption 2: Independence

Observations must be independent of each other. Independence is violated when measurements are correlated in time (time series) or clustered in space (nested data, repeated measures). Common violations: - Time series data: today's value correlated with yesterday's - Panel data: same subject measured multiple times - Nested data: students within classrooms, patients within clinics - Spatial data: measurements close in space are correlated How to check: Durbin-Watson test for time series detects residual autocorrelation. Durbin-Watson statistic of 2.0 indicates no autocorrelation; below 1.5 suggests positive autocorrelation; above 2.5 suggests negative. More modern tests: Ljung-Box, Breusch-Godfrey. For non-time-series clustering, examine residuals grouped by cluster. If residuals within clusters are systematically positive or negative, cluster correlation exists. Responses to dependence: 1. Generalized Least Squares (GLS): accounts for known correlation structure. 2. Random effects / mixed models: allow intercepts or slopes to vary by cluster. 3. Autoregressive models (AR, ARMA): for time series data. 4. Spatial regression: for spatially correlated data. 5. Cluster-robust standard errors: adjusts standard errors for cluster structure. Why it matters: with correlated residuals, standard errors are typically too small, making p-values artificially significant. Reported significance may be spurious — type I error rate is higher than nominal α.

Key Points

  • Independence: observations should not be correlated
  • Check: Durbin-Watson for time series; group residuals for clustered data
  • Fix: GLS, mixed models, AR models, cluster-robust standard errors
  • Violation inflates type I error rate (false positive)
  • Time series, panel data, nested data all typically violate independence

4. Assumption 3: Homoscedasticity

The variance of errors must be constant across all levels of predictors. Heteroscedasticity (non-constant variance) produces inefficient coefficient estimates and biased standard errors. How to check: residuals vs fitted plot. Homoscedasticity looks like a roughly even spread of residuals across the x-axis. Heteroscedasticity looks like a fan or funnel shape — variance increases (or decreases) with fitted values. More rigorous: scale-location plot (sqrt of absolute residuals vs fitted). A flat, horizontal reference line indicates constant variance. Formal tests: - Breusch-Pagan test: regresses squared residuals on predictors; significant result suggests heteroscedasticity - White test: more general version of Breusch-Pagan - Goldfeld-Quandt: splits data into low and high predictor groups, compares variances Worked example: suppose you're predicting house prices from square footage. Lower-priced houses show small residual variance (the model fits well). Higher-priced houses show larger residual variance (more variability). The residuals vs fitted plot shows a funnel shape. Breusch-Pagan p < 0.05 confirms heteroscedasticity. Responses: 1. Log transformation of response: log(price) often stabilizes variance for price data. 2. Weighted Least Squares (WLS): gives less weight to observations with larger variance. 3. Robust standard errors (Huber-White, HC0-HC3): don't change coefficients but give accurate standard errors. 4. Generalized Linear Models (GLM): use appropriate error distribution (gamma, Poisson, etc.). Consequences: homoscedasticity violations don't bias coefficients but make standard errors wrong. P-values are misleading. Confidence intervals are incorrect. Predictions for specific data points may be unreliable.

Key Points

  • Homoscedasticity: constant residual variance across predictor values
  • Check: residuals vs fitted should be evenly scattered (no funnel)
  • Test: Breusch-Pagan, White, Goldfeld-Quandt
  • Fix: log transformation, WLS, robust standard errors
  • Violation: coefficients unbiased but standard errors wrong

5. Assumption 4: Normality of Residuals

Residuals should be approximately normally distributed. This assumption affects the validity of p-values and confidence intervals, especially for small samples. How to check: Q-Q plot (quantile-quantile). Plot residual quantiles against theoretical normal quantiles. Normal residuals produce points along a straight diagonal line. Deviations from the line indicate non-normality — curves suggest skewness, S-shapes suggest heavy or light tails. Formal tests: - Shapiro-Wilk: tests for normality; p > 0.05 suggests normality is plausible - Kolmogorov-Smirnov: compares residual distribution to normal - Anderson-Darling: similar to K-S but more sensitive to tails - Jarque-Bera: tests skewness and kurtosis Important: with large samples, normality tests become overpowered — they reject normality even for trivial departures. With small samples, they lack power — they accept normality even for meaningful departures. Visual inspection via Q-Q plot is often more useful than formal tests, especially for n > 100. Worked example: Q-Q plot shows residuals follow the line for middle quantiles but deviate below for lower quantiles and above for upper quantiles. This is a heavy-tailed distribution — extreme values are more common than normal would predict. Responses: 1. Transform the response variable: log, sqrt, Box-Cox, Yeo-Johnson. These reduce skewness. 2. Use robust regression: minimizes effect of outliers (e.g., M-estimators, MM-estimators). 3. Large samples rely on central limit theorem: for n > 100, coefficient distribution is approximately normal regardless of residual distribution. Mild non-normality is often not a problem. 4. Consider different models: GLM with appropriate error distribution, quantile regression for skewed data. Practical guidance: for samples larger than 100-200, normality violations are usually not catastrophic due to CLT. For small samples (n < 30), non-normality can meaningfully affect inference — take it seriously.

Key Points

  • Normality: residuals approximately normal
  • Check: Q-Q plot should follow straight diagonal line
  • Test: Shapiro-Wilk, K-S, Anderson-Darling
  • Fix: transform response, robust regression, GLM, quantile regression
  • For large samples, CLT provides partial protection

6. Assumption 5: No Multicollinearity

In multiple regression, predictors should not be too highly correlated with each other. High correlation (multicollinearity) makes coefficient estimates unstable and standard errors inflated. How to check: Variance Inflation Factor (VIF). Calculate VIF for each predictor: VIF_j = 1 / (1 - R²_j), where R²_j is the R-squared from regressing predictor j on all OTHER predictors. Interpretation: - VIF = 1: no multicollinearity - VIF = 1-5: acceptable multicollinearity - VIF = 5-10: borderline; consider addressing - VIF > 10: severe multicollinearity; must address Alternative: correlation matrix of predictors. High pairwise correlations (> 0.8 or 0.9) suggest multicollinearity, though VIF captures more than pairwise relationships. Common causes: - Including both x and x² without centering - Including multiple measures of the same concept (e.g., both height in inches and height in cm) - Including an interaction term and its main effects without centering - Subset relationships (e.g., total budget and subset budgets) Responses to multicollinearity: 1. Remove redundant predictors. If two are almost perfectly correlated, drop one. 2. Combine predictors via PCA or factor analysis. 3. Center predictors before computing polynomials or interactions (subtract mean). 4. Ridge regression or LASSO: regularization methods that handle multicollinearity. 5. Accept it if predictions (not coefficients) are the goal: multicollinearity doesn't affect predictive accuracy, only coefficient interpretation. Consequences: coefficients become unstable — small data changes can flip signs. Standard errors are inflated, making individual coefficients appear non-significant even when the overall model is significant. Interpretation becomes misleading — 'controlling for' other predictors becomes meaningless when predictors are nearly redundant.

Key Points

  • No multicollinearity: predictors not too highly correlated
  • Check: VIF per predictor (VIF > 5-10 is problematic)
  • Fix: remove redundant predictors, center polynomials, use PCA or ridge
  • Violation doesn't affect predictions but makes coefficient interpretation unreliable
  • Pairwise correlations > 0.8 are a warning sign

7. Diagnostic Workflow: What to Do in Practice

Standard diagnostic workflow for a linear regression: 1. Fit the model and extract residuals. 2. Look at the four standard diagnostic plots: - Residuals vs Fitted: check linearity and homoscedasticity - Q-Q Plot: check normality - Scale-Location: check homoscedasticity more rigorously - Residuals vs Leverage: identify influential outliers 3. Compute VIF for all predictors. Address any VIF > 5. 4. If residuals show patterns, identify which assumption is violated and apply the appropriate fix. 5. Re-fit after fixes and recheck diagnostics. 6. Report violations, fixes, and final diagnostic results in your analysis writeup. Modern software: - R: plot(lm_model) produces the four standard diagnostic plots automatically. car package provides VIF. lmtest provides formal tests. - Python (statsmodels): model.summary() gives basic diagnostics; scipy.stats.anderson for normality; statsmodels.stats.diagnostic for tests. - Python (scikit-learn): less automatic — need to construct diagnostic plots manually. - SAS/SPSS: diagnostic options available through regression menus. Important caveat: statistical significance of assumption tests is NOT the same as magnitude of violation. For large samples, trivial departures from normality or homoscedasticity can be 'significant' but not practically meaningful. For small samples, meaningful departures may not reach significance. Always combine formal tests with visual diagnostics and judgment.

Key Points

  • Standard workflow: fit model, check residual plots, compute VIF, address issues, re-fit
  • Four diagnostic plots: residuals vs fitted, Q-Q, scale-location, residuals vs leverage
  • Computational tools available in R, Python, SAS, SPSS
  • Large samples: formal tests may be over-sensitive
  • Small samples: formal tests may be under-powered

Key Takeaways

  • Five assumptions: linearity, independence, homoscedasticity, normality, no multicollinearity
  • Residuals vs fitted plot: checks linearity AND homoscedasticity
  • Q-Q plot: checks normality of residuals
  • VIF > 5-10 indicates multicollinearity
  • Shapiro-Wilk: formal normality test
  • Breusch-Pagan: formal homoscedasticity test
  • Durbin-Watson: checks residual independence for time series
  • Fix non-linearity: transform variables or use polynomial/spline models
  • Fix heteroscedasticity: log transform or use robust standard errors
  • For large samples, CLT partially protects against normality violations

Practice Questions

1. Your residuals vs fitted plot shows a clear funnel shape — residuals are small for low fitted values and large for high fitted values. Which assumption is violated and what should you do?
Heteroscedasticity (non-constant variance). Coefficients are still unbiased but standard errors and p-values are wrong. Response options: (1) log transformation of the response variable to stabilize variance, (2) Weighted Least Squares giving less weight to high-variance observations, (3) robust (Huber-White) standard errors to get correct inference without changing the model, (4) generalized linear model (GLM) with appropriate error distribution. Pick based on what the data suggests.
2. You calculate VIF for three predictors and get: Age = 1.5, Income = 8.5, Savings = 9.2. What does this indicate and what should you do?
Income and Savings have high VIF (above 5, close to 10), suggesting substantial multicollinearity between them. They're likely measuring correlated aspects (richer people save more). Options: (1) remove one (pick the less interpretively important), (2) combine via PCA into a single 'wealth' factor, (3) use ridge regression which handles multicollinearity via regularization. Age is fine at 1.5. The overall model may still have good predictive accuracy, but individual coefficient interpretations for Income and Savings are unreliable.
3. Your Q-Q plot shows residuals following the diagonal line in the middle but curving below in the lower tail and above in the upper tail. Describe the distribution shape and its implications.
The residuals are heavy-tailed (leptokurtic) — extreme values (both positive and negative) are more common than normal predicts. This means p-values from t-tests on coefficients may be too conservative in some ways and the model may be influenced by extreme residuals. Responses: (1) check for outliers and consider robust regression, (2) try transformations that compress tails (log, Box-Cox), (3) for large samples, CLT provides some protection so inference may still be approximately valid, (4) report heavy-tailed residuals in your writeup and acknowledge the limitation.
4. You fit a regression on 50 observations and the Shapiro-Wilk p-value for residuals is 0.04. Does this mean your regression is invalid?
Not automatically. A significant Shapiro-Wilk (p < 0.05) indicates the residuals are non-normal at α = 0.05. For n = 50, this is a meaningful sample size but not large enough for CLT to fully protect against non-normality in inference. Response: (1) examine the Q-Q plot visually — is the departure severe (major skew, heavy tails) or mild? (2) Consider transformations of the response variable to improve normality, (3) use bootstrap confidence intervals instead of normal-theory ones, (4) for substantial skew, try quantile regression or GLM.
5. A colleague reports a regression p-value of 0.03 without showing residual diagnostics. Should you trust the result?
Without diagnostics, you can't know. The p-value of 0.03 is meaningful ONLY if the regression assumptions are approximately met. If residuals are heteroscedastic, correlated, or strongly non-normal, the p-value could be too small or too large. Ask for diagnostic plots (residuals vs fitted, Q-Q plot, VIF values) before accepting the inference. This is standard practice in rigorous analysis — reporting results without assumption checks is incomplete.

Study with AI

Get personalized help and instant answers anytime.

Download StatsIQ

FAQs

Common questions about this topic

Depends on context. For correlated observations (time series, panel, clustered data), INDEPENDENCE is most important because violations can dramatically inflate type I error. For predictive modeling with continuous predictors, LINEARITY is most important because missing non-linearity means the model is systematically wrong. For interpreting coefficients, NO MULTICOLLINEARITY is critical. For p-value reliability with small samples, NORMALITY matters more than with large samples. All five should be checked, but priorities depend on the question.

Partially. Central Limit Theorem provides protection against normality violations for large samples (n > 100-200). But the other four assumptions — linearity, independence, homoscedasticity, no multicollinearity — remain critical regardless of sample size. Large samples actually make formal tests MORE sensitive, so you'll detect more violations. Always check residual plots even with large samples.

Statistical significance (p < 0.05 on Shapiro-Wilk, Breusch-Pagan) just means the violation is detectable given the sample size. Practical significance is whether the violation meaningfully affects your conclusions. For large samples, tiny violations become statistically significant but don't change inferences. For small samples, real violations may not reach significance but still affect inference. Look at the residual plots and ask: does this departure look severe enough to worry about?

Yes. A log(y) ~ x regression gives coefficients interpreted as 'a one-unit increase in x is associated with an approximately (100 × β)% change in y' for small β. This is a different interpretation than raw y ~ x. When reporting transformed regressions, always clarify the interpretation and provide examples converting back to the original scale when useful. Some analyses prefer to bootstrap predictions on the original scale rather than report transformed coefficients.

Common in practice. Prioritize: fix the most severe violation first, re-fit, re-check. A log transformation of the response often addresses multiple issues simultaneously (reduces skewness, stabilizes variance, may linearize the relationship). For particularly messy data, consider moving to a different model family — GLM with appropriate error distribution, mixed models for nested structure, quantile regression for skewed responses.

Yes. Upload or describe your regression output and StatsIQ generates the diagnostic plot interpretation, identifies which assumptions are violated based on the visual pattern, suggests appropriate fixes (transformations, robust standard errors, alternative models), and walks through the corrective workflow. Works for R, Python, SPSS, and SAS regression outputs. This content is for educational purposes only and does not constitute statistical advice.

More Study Guides