5

I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).

Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):

library(dplyr)
library(tidyr)
library(purrr)

mtcars %>%
  select(-mpg) %>% #exclude outcome, leave predictors
  map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
  map_df(glance, .id='variable') %>%
  select(variable, r.squared, p.value)

# A tibble: 10 x 3
   variable r.squared  p.value
   <chr>        <dbl>    <dbl>
 1 cyl          0.726 6.11e-10
 2 disp         0.718 9.38e-10
 3 hp           0.602 1.79e- 7
 4 drat         0.464 1.78e- 5
 5 wt           0.753 1.29e-10
 6 qsec         0.175 1.71e- 2
 7 vs           0.441 3.42e- 5
 8 am           0.360 2.85e- 4
 9 gear         0.231 5.40e- 3
10 carb         0.304 1.08e- 3

And running a linear model over grouped variables is also easy using map:

mtcars %>%
  split(.$cyl) %>% #split by grouping variable
  map(~ lm(mpg ~ wt, data = .)) %>%
  map_df(broom::glance, .id='cyl') %>%
  select(cyl, variable, r.squared, p.value)

# A tibble: 3 x 3
  cyl   r.squared p.value
  <chr>     <dbl>   <dbl>
1 4         0.509  0.0137
2 6         0.465  0.0918
3 8         0.423  0.0118

So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:

mtcars %>%
  select(-mpg) %>% #exclude outcome, leave predictors
  split(.$cyl) %>% # group by grouping variable
  map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
  map_df(glance, .id='cyl') %>%
  select(cyl, variable, r.squared, p.value)

and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).

But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!

2
  • This seems to work (?) mtcars %>% select(-mpg) %>% group_by(cyl) %>% map(~ lm(mtcars$mpg ~ .x, data = mtcars)) . Why do you need to use split? Commented Dec 12, 2021 at 20:52
  • 1
    @NelsonGon Using group_by() does run without throwing any errors, but for me doesn't seem to give the regression results split by group - as far as I can tell, it gives identical results as in the first example, showing the stats for each variable. Commented Dec 12, 2021 at 21:20

1 Answer 1

2

IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.

If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

# A tibble: 27 × 4
# Groups:   cyl [3]
     cyl iv    r.squared  p.value
   <dbl> <chr>     <dbl>    <dbl>
 1     4 disp  0.648      0.00278
 2     4 hp    0.274      0.0984 
 3     4 drat  0.180      0.193  
 4     4 wt    0.509      0.0137 
 5     4 qsec  0.0557     0.485  
 6     4 vs    0.00238    0.887  
 7     4 am    0.287      0.0892 
 8     4 gear  0.115      0.308  
 9     4 carb  0.0378     0.567  
10     6 disp  0.0106     0.826  
11     6 hp    0.0161     0.786  
# ...
Sign up to request clarification or add additional context in comments.

5 Comments

Interesting, I'm new to purrr and hadn't encountered group_modify, will explore more! Maybe I wasn't clear enough in my original question, sorry (will edit to add examples of the results I got/am looking for), but I'm trying to run individual regressions for each variable (e.g., disp, hp, drat) as well as for each group. This answer doesn't seen to differentiate between variables in the results. But I'll explore group_modify() more, see if I can modify this somehow.
Ah, I see. Ok, update coming - might need nested map
This does what I needed, thank you! This is outside the scope of the original question, so I'll mark it answered regardless, but: my real data is messy and contains some groups that are entirely NA. Normally I'd add a line like "if (all(is.na(iv)) {return(NA)} else" right before my lm function, but I have no idea how to do that nested inside map, do you have any ideas for that? Thanks regardless!
This is almost certainly not the most elegant solution, but for anyone who ever stumbles across this same issue, this is what I came up with to deal with the na issue: map_df(ivs, function(iv) { tmpvar<-eval(parse(text = paste0("data$",iv))) if(all(is.na(tmpvar))) {return(NA)} else{ frml <- as.formula(paste("mpg", "~", iv)) lm(frml, data = data) %>% broom::glance() }}, .id = "iv")
Correcting my attempt, the correct solution to this is in this follow-up question: stackoverflow.com/questions/70344641/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.