Interpreting Multiple Linear Regression

Question

Below are the results of two different linear regressions. The first only has an N of 10 while the second has an N of 43. The first has a very high association between the dependent variable and the independent variables overall (Adj R2 = 0.948, P = 0.003). The second one only has one variable in the model with a significant association with the dependent variable and a lower overall association (Adj R2 = 0.436, P < 0.001).

I am trying to understand some of the reasons for these two sets of results. The first seems spurious due to the small N. I am still having trouble understanding how the second could have an adjusted R2 as high as it does considering only one variable has a significant association with the dependent. What kinds of things might be going on here?

If the dependent variable is different you can not compare the R2. — Robert
– Robert, Commented Jul 30, 2016 at 19:09
The dependent variable is the same. The study area is different though. — SteveC
– SteveC, Commented Jul 30, 2016 at 21:07
R^2 for the second model is not that high imho... it only "explains" (or better covers) ~50% of the variance, right ? — Drey
– Drey, Commented Aug 1, 2016 at 10:42

Robert Long · Accepted Answer · 2024-11-13 20:14:06Z

What kinds of things might be going on here?

There could be all kinds of things going on here. Without details of how the study was conducted, and what the research question is, all we can do is offer a few ideas.

You are right to point our the small sample size. With only 10 observations, and. apparently, 8 variables (including the intercept) this model is likely to be over-fitted. In such a case you would expect to find a large $R^2$, which indeed is the case.
For the second model it seems that the concern is that $R^2$ is high. Well, "high" in relation to what ? 0.44 is not particularly high. Moreover there is no reason why you can't get "high" $R^2$ values when only 1 variable in the model is "significant". Here is a simple simulation in R and Python to demonstrate:

This is a simple simulation, where we use a 7 x 7 correlation matrix to create 7 variables, as per the OP, one of which will be used to obtain a response vector - so that the regression results should show a "significant" p-value for that one variable, and "non-significant" ones for the 6 others: First with R:

library(MASS) # needed for the mvrnorm function

N <- 43

# Correlation matrix for the covariates. Not strictly necessary, but more realistic than having them independent:
sigma <- matrix(c(1, 0.2, 0.3, 0.1, 0.4, 0.2, 0.3,
                  0.2, 1, 0.5, 0.2, 0.3, 0.1, 0.2,
                  0.3, 0.5, 1, 0.3, 0.4, 0.2, 0.1,
                  0.1, 0.2, 0.3, 1, 0.3, 0.5, 0.2,
                  0.4, 0.3, 0.4, 0.3, 1, 0.2, 0.1,
                  0.2, 0.1, 0.2, 0.5, 0.2, 1, 0.4,
                  0.3, 0.2, 0.1, 0.2, 0.1, 0.4, 1), 
                nrow = 7, ncol = 7)

# Means
mu <- rep(0, 7)

# Now generate the dataset
set.seed(15)  
data <- as.data.frame(mvrnorm(n = N, mu = mu, Sigma = sigma))

data$Y =10 +             # intercept   
    data$V1  +       # beta = 1
        rnorm(N, 0, 1)
        
lm(Y ~ ., data = data) |> summary()

which yields:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.08403    0.17363  58.078  < 2e-16 ***
V1           0.99665    0.16281   6.122 5.34e-07 ***
V2           0.08413    0.18381   0.458    0.650    
V3          -0.20236    0.18788  -1.077    0.289    
V4           0.17449    0.19707   0.885    0.382    
V5           0.23005    0.16789   1.370    0.179    
V6          -0.23291    0.18362  -1.268    0.213    
V7          -0.19477    0.19615  -0.993    0.328    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9352 on 35 degrees of freedom
Multiple R-squared:  0.606, Adjusted R-squared:  0.5272 
F-statistic: 7.691 on 7 and 35 DF,  p-value: 1.271e-05

As you can see, here we have a similar adjusted $R^2$ of 0.53 and a "significant" coefficient for V1 , as per the 2nd model in the question

For those who prefer Python, here is some equivalent code:

import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

np.random.seed(15)

N = 43

# Correlation matrix for the covariates. Not strictly necessary, but moer realistic than having them independent
sigma = np.array([
    [1, 0.2, 0.3, 0.1, 0.4, 0.2, 0.3],
    [0.2, 1, 0.5, 0.2, 0.3, 0.1, 0.2],
    [0.3, 0.5, 1, 0.3, 0.4, 0.2, 0.1],
    [0.1, 0.2, 0.3, 1, 0.3, 0.5, 0.2],
    [0.4, 0.3, 0.4, 0.3, 1, 0.2, 0.1],
    [0.2, 0.1, 0.2, 0.5, 0.2, 1, 0.4],
    [0.3, 0.2, 0.1, 0.2, 0.1, 0.4, 1]
])

# Now generate the dataset
data = np.random.multivariate_normal(mean=np.zeros(7), cov=sigma, size=N)
df = pd.DataFrame(data, columns=[f'V{i+1}' for i in range(7)])

# Compute the response variable Y
df['Y'] = 10 + df['V1'] + np.random.normal(0, 1, N)

# Fit the linear model
X = sm.add_constant(df.drop(columns=['Y']))  # Add intercept
model = sm.OLS(df['Y'], X).fit()

# Retrieve adjusted R2, p-values, and coefficients
adjusted_r_squared = model.rsquared_adj
p_values = model.pvalues
coefficients = model.params

# Combine coefficients and p-values into a single DataFrame for display
results_df = pd.DataFrame({
    'Coefficient': coefficients,
    'P-value': p_values
})

# Display results
print("Adjusted R²:", adjusted_r_squared)
print("\nCoefficients and P-values:\n", results_df)

which gives us:

Adjusted R²: 0.47987115035473205

Coefficients and P-values:
        Coefficient       P-value
const     9.852533  2.418140e-36
V1        1.093086  4.513510e-07
V2       -0.091330  6.783820e-01
V3       -0.031855  8.752859e-01
V4        0.081758  6.878198e-01
V5        0.045772  8.283457e-01
V6       -0.012240  9.509285e-01
V7       -0.206075  2.707206e-01

which is quite similar to what we obtained in R - it's not identical because of random variation.

Other things that could be going on include:

unmeasured confounding
multicollinearity
mediation
over-adjustment

to name a few.

Stack Exchange Network

Interpreting Multiple Linear Regression

1 Answer 1

Your Answer

Hot Network Questions

Interpreting Multiple Linear Regression

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions