2

I created two models using the lm() function in R. The first model, I created the design matrix for my prediction variable and then fed that into the lm() function.

copy <- data.frame(mtcars)

reduced_copy <- model.matrix(~cyl + hp, data = copy)

mpg <- copy$mpg

copy_model <- lm(mpg ~ 0 + reduced_copy)
print(summary(copy_model))

The summary of the model is below:

Model Summary

For my second model, I converted the cyl variable into a factor and then created a model from the data.

copy2 <- data.frame(mtcars)

copy2$cyl <- as.factor(copy2$cyl)

copy2_model <- lm(mpg ~ cyl + hp, data = copy2)
print(summary(copy2_model))

And the summary of the model below:

Model Summary

The intercept and regression coefficients are the same for both models. What I do not understand is why the R-Squared Values for each are so different. From my understanding, the lm() function creates a design matrix under the hood, so I figured that the two models would be the same, if not very similar in their results.

2
  • 1
    Please don't upload code, error messages, results or data as images for these reasons - and these. Commented Jun 6 at 14:12
  • 1
    The code you provide does not create the output provided on my machine - specifically copy2_model has a cyl4 level, and copy_model shows cyl as continuous rather than a factor. It would be good to check the output. Commented Jun 6 at 14:46

1 Answer 1

5

summary.lm computes R^2 differently for intercept and no-intercept models. The way you set up your first model misleads R into thinking there is no intercept.

From https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f

As described in ?summary.lm, when the intercept is zero (e.g., from y ~ x - 1 or y ~ x + 0), summary.lm() uses the formula R^2 = 1 - Sum(R[i]^2) / Sum((y[i])^2), which is different from the usual R^2 = 1 - Sum(R[i]^2) / Sum((y[i] - mean(y))^2). There are several reasons for this:

  • Otherwise the R^2 could be negative (because the model with zero intercept can fit worse than the constant-mean model it is implicitly compared to).
  • If you set the slope to zero in the model with a line through the origin you get fitted values y*=0
  • The model with constant, non-zero mean is not nested in the model with a line through the origin.

All these come down to saying that if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere. (If you don’t know a priori that E[Y]=0 when x=0, then you probably shouldn’t be fitting a line through the origin.)

See also https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-why-are-r2-and-f-so-large-for-models-without-a-constant/

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.