Why is there a major difference in R-Squared between my models created with the same data? [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 5 months ago.

Improve this question

I created two models using the lm() function in R. The first model, I created the design matrix for my prediction variable and then fed that into the lm() function.

copy <- data.frame(mtcars)

reduced_copy <- model.matrix(~cyl + hp, data = copy)

mpg <- copy$mpg

copy_model <- lm(mpg ~ 0 + reduced_copy)
print(summary(copy_model))

The summary of the model is below:

Model Summary

For my second model, I converted the cyl variable into a factor and then created a model from the data.

copy2 <- data.frame(mtcars)

copy2$cyl <- as.factor(copy2$cyl)

copy2_model <- lm(mpg ~ cyl + hp, data = copy2)
print(summary(copy2_model))

And the summary of the model below:

Model Summary

The intercept and regression coefficients are the same for both models. What I do not understand is why the R-Squared Values for each are so different. From my understanding, the lm() function creates a design matrix under the hood, so I figured that the two models would be the same, if not very similar in their results.

Please don't upload code, error messages, results or data as images for these reasons - and these. — Limey
– Limey, Commented Jun 6 at 14:12
The code you provide does not create the output provided on my machine - specifically copy2_model has a cyl4 level, and copy_model shows cyl as continuous rather than a factor. It would be good to check the output. — Miff
– Miff, Commented Jun 6 at 14:46

Ben Bolker · Accepted Answer · 2025-06-06 14:57:23Z

summary.lm computes R^2 differently for intercept and no-intercept models. The way you set up your first model misleads R into thinking there is no intercept.

From https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f

As described in ?summary.lm, when the intercept is zero (e.g., from y ~ x - 1 or y ~ x + 0), summary.lm() uses the formula R^2 = 1 - Sum(R[i]^2) / Sum((y[i])^2), which is different from the usual R^2 = 1 - Sum(R[i]^2) / Sum((y[i] - mean(y))^2). There are several reasons for this:

Otherwise the R^2 could be negative (because the model with zero intercept can fit worse than the constant-mean model it is implicitly compared to).

If you set the slope to zero in the model with a line through the origin you get fitted values y*=0

The model with constant, non-zero mean is not nested in the model with a line through the origin.

All these come down to saying that if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere. (If you don’t know a priori that E[Y]=0 when x=0, then you probably shouldn’t be fitting a line through the origin.)

See also https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-why-are-r2-and-f-so-large-for-models-without-a-constant/

Collectives™ on Stack Overflow

Why is there a major difference in R-Squared between my models created with the same data? [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related