Coefficient of Determination (R²) Formula: 1 - SS_res / SS_tot Explained With Worked Examples
A complete guide to the coefficient of determination (R²) — covering the formula 1 - SS_res / SS_tot, how to compute the sum of squared residuals and total sum of squares, what R² means in regression analysis, and how to interpret values from 0 to 1 with worked examples.
What You'll Learn
- ✓Calculate R² using the formula 1 - SS_res / SS_tot from a dataset of observed and predicted values
- ✓Explain what SS_res (sum of squared residuals) and SS_tot (total sum of squares) represent geometrically
- ✓Interpret R² values in the context of regression model quality
- ✓Distinguish R² from adjusted R² and from the simple correlation coefficient r
1. The Direct Answer: R² = 1 - SS_res / SS_tot Measures Variance Explained
The coefficient of determination, denoted R² (R squared), is a statistic that measures the proportion of variance in the dependent variable that is explained by the regression model. The most common formula is: R² = 1 - (SS_res / SS_tot) Where SS_res (sum of squared residuals, also called sum of squared errors or SSE) is the sum of squared differences between the observed y values and the predicted y values from the model, and SS_tot (total sum of squares) is the sum of squared differences between the observed y values and the mean of y. Intuitively: SS_tot measures the total variability in y around its mean. SS_res measures how much variability remains AFTER fitting the model. The ratio SS_res / SS_tot is the proportion of variability that the model FAILED to explain. So 1 minus that ratio is the proportion of variability the model DID explain — which is R². R² ranges from 0 to 1 (or 0% to 100%). R² = 0 means the model explains none of the variance (the model is no better than just predicting the mean). R² = 1 means the model explains all of the variance (the predictions perfectly match the observed data). Real-world R² values typically fall between 0.2 and 0.9 depending on the field and the strength of the relationship. The alternative formula gives the same answer for simple linear regression: R² = (covariance of x and y / (standard deviation of x × standard deviation of y))² = r² (where r is the Pearson correlation coefficient). For simple linear regression with one predictor, R² is literally the square of the correlation coefficient — which is where the name R squared comes from. Snap a photo of any R² problem and StatsIQ walks through the SS_res and SS_tot computation, applies the formula, and explains the interpretation in the context of the data.
Key Points
- •R² = 1 - SS_res / SS_tot. Measures proportion of variance explained by the regression model.
- •SS_res = sum of (observed - predicted)². SS_tot = sum of (observed - mean of y)².
- •R² range: 0 to 1. R² = 0 means model explains nothing. R² = 1 means perfect fit.
- •For simple linear regression with one predictor, R² = r² where r is the Pearson correlation.
2. Computing SS_res, SS_tot, and R² Step by Step
Walk through the calculation with a small dataset. Suppose you have 5 observations of (x, y): (1, 2), (2, 4), (3, 5), (4, 4), (5, 5) Step 1: Find the mean of y. ȳ = (2 + 4 + 5 + 4 + 5) / 5 = 20 / 5 = 4. Step 2: Calculate SS_tot = Σ(y_i - ȳ)². (2-4)² + (4-4)² + (5-4)² + (4-4)² + (5-4)² = 4 + 0 + 1 + 0 + 1 = 6. SS_tot = 6. Step 3: Fit a linear regression line. Using the standard formulas: slope (b) = Σ((x_i - x̄)(y_i - ȳ)) / Σ((x_i - x̄)²) x̄ = (1+2+3+4+5)/5 = 3. Numerator: (1-3)(2-4) + (2-3)(4-4) + (3-3)(5-4) + (4-3)(4-4) + (5-3)(5-4) = 4 + 0 + 0 + 0 + 2 = 6. Denominator: (1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)² = 4 + 1 + 0 + 1 + 4 = 10. b = 6/10 = 0.6. intercept (a) = ȳ - b × x̄ = 4 - 0.6 × 3 = 4 - 1.8 = 2.2. Regression equation: ŷ = 2.2 + 0.6x. Step 4: Calculate predicted values ŷ_i for each x. x=1: ŷ = 2.2 + 0.6(1) = 2.8 x=2: ŷ = 2.2 + 0.6(2) = 3.4 x=3: ŷ = 2.2 + 0.6(3) = 4.0 x=4: ŷ = 2.2 + 0.6(4) = 4.6 x=5: ŷ = 2.2 + 0.6(5) = 5.2 Step 5: Calculate SS_res = Σ(y_i - ŷ_i)². (2 - 2.8)² + (4 - 3.4)² + (5 - 4.0)² + (4 - 4.6)² + (5 - 5.2)² = 0.64 + 0.36 + 1.0 + 0.36 + 0.04 = 2.4. SS_res = 2.4. Step 6: Apply the R² formula. R² = 1 - SS_res / SS_tot = 1 - 2.4 / 6 = 1 - 0.4 = 0.6. Interpretation: the regression model explains 60% of the variance in y. The remaining 40% is unexplained — due to random variation, measurement error, or factors not included in the model. Verification using the correlation approach (since this is simple linear regression): r = covariance(x,y) / (sd_x × sd_y) = 6 / sqrt(10 × 6) = 6 / sqrt(60) = 6 / 7.746 = 0.7746. r² = 0.6. Same answer. StatsIQ handles this calculation automatically — snap any dataset and it computes the regression line, predicted values, residuals, SS_res, SS_tot, and R² with the full work shown.
Key Points
- •Find ȳ, calculate SS_tot from deviations of y around its mean.
- •Fit the regression line to get slope and intercept.
- •Compute predicted values ŷ for each x using the regression equation.
- •SS_res = sum of squared differences between observed and predicted. R² = 1 - SS_res/SS_tot.
3. Interpreting R² Values in Context
The numerical value of R² alone does not tell you whether a model is good or bad — context matters enormously. The same R² value can be excellent in one field and poor in another. R² values by domain (rough guidelines): - Physics, engineering, lab measurements: R² > 0.95 is typical for well-controlled experiments. Anything below 0.9 suggests measurement noise or missing variables. - Chemistry experiments: R² > 0.99 for calibration curves (very tight relationships expected). - Biology and medical research: R² of 0.5-0.8 is good. Biological variability is high. - Social sciences (psychology, sociology, education): R² of 0.3-0.5 is good. Human behavior is noisy. - Economics, finance: R² of 0.1-0.3 is common for many models. Markets are extremely noisy. - Marketing, ad attribution: R² of 0.2-0.5 is typical. Attribution is hard. A model with R² = 0.4 in physics is a disaster. The same model in marketing is reasonably good. Always interpret R² relative to what is achievable in your field. What R² does NOT tell you: (1) whether the model is causally correct (high R² with the wrong variables is meaningless), (2) whether the model will generalize to new data (high R² in training data may reflect overfitting), (3) whether the relationship is linear vs nonlinear (you can have low R² because a linear model is the wrong functional form, not because there is no relationship), (4) whether individual predictions are accurate (high R² means the model fits well on average, but specific predictions can still be wildly off), and (5) whether the sample size is adequate (high R² with n=4 is essentially meaningless). The most common mistake in R² interpretation: chasing high R² by adding more variables. R² ALWAYS increases when you add more predictors to a model, even if those predictors are noise. This is why we use ADJUSTED R² for multiple regression with many predictors. Adjusted R² penalizes model complexity and only increases when added predictors meaningfully improve the fit. For publication and serious analysis, report R² alongside: the sample size, the residual standard error, F-test significance, and ideally a residual plot to check assumptions. R² alone is not sufficient — it is one metric among many for assessing model quality. StatsIQ explains R² interpretation in the context of your specific data and field, identifies common interpretation errors, and recommends when to use adjusted R² or alternative goodness-of-fit metrics.
Key Points
- •R² interpretation depends heavily on the field. R² = 0.4 is bad in physics, good in marketing.
- •High R² does NOT mean: causally correct, will generalize, individual predictions are accurate, or the model is unique.
- •R² always INCREASES when you add more predictors. Use adjusted R² to penalize complexity.
- •Always report R² with sample size, residual SE, F-test, and ideally a residual plot for context.
4. R² vs Adjusted R² vs Correlation Coefficient
R² has close cousins that are easy to confuse but mean different things. Knowing the distinctions is essential for any regression analysis. **R² (coefficient of determination)**: the basic version. Measures proportion of variance explained. Formula: 1 - SS_res / SS_tot. Range 0 to 1. Always increases (or stays the same) when you add predictors to the model. Use for: simple linear regression with one predictor, or descriptive comparison of explained variance. **Adjusted R² (R² adj or R̄²)**: penalizes model complexity. Formula: 1 - [(1 - R²)(n - 1) / (n - k - 1)], where n is sample size and k is the number of predictors. Adjusted R² can DECREASE when you add a predictor that does not improve the model meaningfully. Use for: comparing models with different numbers of predictors, multiple regression, and any setting where you want to penalize unnecessary complexity. Example of why adjusted R² matters: a model with 5 predictors might have R² = 0.55. Adding 5 more predictors might bump R² to 0.60 — but if those new predictors are random noise, adjusted R² might go DOWN from 0.50 to 0.45. Adjusted R² catches the overfitting that raw R² hides. **Correlation coefficient (r)**: measures the strength and direction of a linear relationship between two variables. Range -1 to +1. Negative values indicate inverse relationships. Formula: covariance(x,y) / (sd_x × sd_y). For simple linear regression with one predictor, r² = R². The correlation coefficient is more informative than R² because it tells you the direction (positive or negative) as well as the strength. **Multiple R**: the multiple correlation coefficient. For multiple regression with several predictors, multiple R is the correlation between the observed y values and the predicted y values. Multiple R squared = R² (the same R² we have been discussing). For simple linear regression with one predictor, multiple R = |r| (absolute value of the simple correlation). Relationships: - Simple linear regression: R² = r² - Multiple regression: R² = (multiple R)² - Adjusted R² < R² always (when k > 0) - Adjusted R² penalty grows with more predictors and shrinks with larger samples Which to report: for descriptive analysis with one predictor, report R² (or r). For regression with multiple predictors, report adjusted R² as the primary measure and R² as a secondary measure. For very large samples (n > 1,000), the difference between R² and adjusted R² is negligible. For small samples (n < 50), the difference can be substantial and using adjusted R² is critical. NCLEX-style stats question variants: questions often test whether students know that R² always increases with more predictors, that adjusted R² is the right metric for multi-predictor models, and that R² = r² for simple linear regression. Knowing these distinctions cold prevents common test mistakes. StatsIQ calculates R², adjusted R², and the simple correlation coefficient for any regression problem and explains the differences between them with the specific data context.
Key Points
- •R²: basic, always non-negative, always increases with more predictors. Use for one-predictor models.
- •Adjusted R²: penalizes complexity, can decrease with bad predictors. Use for multi-predictor comparison.
- •r (correlation coefficient): -1 to +1. r² = R² for simple linear regression. Tells you direction.
- •Multiple R: correlation between observed and predicted y in multiple regression. (Multiple R)² = R².
Key Takeaways
- ★R² = 1 - SS_res / SS_tot. Proportion of variance in y explained by the regression model.
- ★SS_res = sum of squared residuals. SS_tot = sum of squared deviations of y from its mean.
- ★For simple linear regression with one predictor, R² = r² (square of the correlation coefficient).
- ★R² always increases when adding predictors. Use adjusted R² to compare multi-predictor models.
- ★R² interpretation depends on field. R² = 0.4 is bad in physics, good in marketing.
Practice Questions
1. A regression model has SS_tot = 250 and SS_res = 75. Calculate R² and interpret the result.
2. A researcher fits a regression with 3 predictors and gets R² = 0.55. They add 5 more predictors and R² jumps to 0.62. They claim the new model is better. What is wrong with this reasoning?
FAQs
Common questions about this topic
For ordinary least squares (OLS) regression with an intercept, R² is always between 0 and 1 — it cannot be negative. However, for some special cases (regression without an intercept, certain non-OLS methods, or when comparing a model to a baseline that is not the mean), R² can technically be negative. A negative R² would mean the model fits WORSE than just predicting the mean of y. In standard linear regression problems, you should not see negative R² — if you do, check your formula or your model setup.
Yes. Snap a photo of any regression problem with raw data and StatsIQ calculates the regression line, predicted values, SS_res, SS_tot, R², adjusted R², and the correlation coefficient. It shows the work step by step and explains the interpretation in the context of your specific data and field.