Running single linear regressions across multiple variables, in groups

Question

I'm trying to run a simple single linear regression over a large number of variables, grouped according to another variable. Using the mtcars dataset as an example, I'd like to run a separate linear regression between mpg and each other variable (mpg ~ disp, mpg ~ hp, etc.), grouped by another variable (for example, cyl).

Running lm over each variable independently can easily be done using purrr::map (modified from this great tutorial - https://sebastiansauer.github.io/EDIT-multiple_lm_purrr_EDIT/):

library(dplyr)
library(tidyr)
library(purrr)

mtcars %>%
  select(-mpg) %>% #exclude outcome, leave predictors
  map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>%
  map_df(glance, .id='variable') %>%
  select(variable, r.squared, p.value)

# A tibble: 10 x 3
   variable r.squared  p.value
   <chr>        <dbl>    <dbl>
 1 cyl          0.726 6.11e-10
 2 disp         0.718 9.38e-10
 3 hp           0.602 1.79e- 7
 4 drat         0.464 1.78e- 5
 5 wt           0.753 1.29e-10
 6 qsec         0.175 1.71e- 2
 7 vs           0.441 3.42e- 5
 8 am           0.360 2.85e- 4
 9 gear         0.231 5.40e- 3
10 carb         0.304 1.08e- 3

And running a linear model over grouped variables is also easy using map:

mtcars %>%
  split(.$cyl) %>% #split by grouping variable
  map(~ lm(mpg ~ wt, data = .)) %>%
  map_df(broom::glance, .id='cyl') %>%
  select(cyl, variable, r.squared, p.value)

# A tibble: 3 x 3
  cyl   r.squared p.value
  <chr>     <dbl>   <dbl>
1 4         0.509  0.0137
2 6         0.465  0.0918
3 8         0.423  0.0118

So I can run by variable, or by group. However, I can't figure out how to combine these two (grouping everything by cyl, then running lm(mpg ~ each other variable, separately). I'd hoped to do something like this:

mtcars %>%
  select(-mpg) %>% #exclude outcome, leave predictors
  split(.$cyl) %>% # group by grouping variable
  map(~ lm(mtcars$mpg ~ .x, data = mtcars)) %>% #run lm across all variables
  map_df(glance, .id='cyl') %>%
  select(cyl, variable, r.squared, p.value)

and get a result that gives me cyl(group), variable, r.squared, and p.value (a combination of 3 groups * 10 variables = 30 model outputs).

But split() turns the dataframe into a list, which the construction from part 1 [ map(~ lm(mtcars$mpg ~ .x, data = mtcars)) ] can't handle. I have tried to modify it so that it doesn't explicitly refer to the original data structure, but can't figure out a working solution. Any help is greatly appreciated!

This seems to work (?) mtcars %>% select(-mpg) %>% group_by(cyl) %>% map(~ lm(mtcars$mpg ~ .x, data = mtcars)) . Why do you need to use split? — NelsonGon
– NelsonGon, Commented Dec 12, 2021 at 20:52
@NelsonGon Using group_by() does run without throwing any errors, but for me doesn't seem to give the regression results split by group - as far as I can tell, it gives identical results as in the first example, showing the stats for each variable. — Jaken
– Jaken, Commented Dec 12, 2021 at 21:20

andrew_reece · Accepted Answer · 2021-12-12 22:18:35Z

2

IIUC, you can use group_by and group_modify, with a map inside that iterates over predictors.

If you can isolate your predictor variables in advance, it'll make it easier, as with ivs in this solution.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

# A tibble: 27 × 4
# Groups:   cyl [3]
     cyl iv    r.squared  p.value
   <dbl> <chr>     <dbl>    <dbl>
 1     4 disp  0.648      0.00278
 2     4 hp    0.274      0.0984 
 3     4 drat  0.180      0.193  
 4     4 wt    0.509      0.0137 
 5     4 qsec  0.0557     0.485  
 6     4 vs    0.00238    0.887  
 7     4 am    0.287      0.0892 
 8     4 gear  0.115      0.308  
 9     4 carb  0.0378     0.567  
10     6 disp  0.0106     0.826  
11     6 hp    0.0161     0.786  
# ...

edited Dec 12, 2021 at 22:18

answered Dec 12, 2021 at 21:31

andrew_reece

21.4k3 gold badges40 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jaken Over a year ago

Interesting, I'm new to purrr and hadn't encountered group_modify, will explore more! Maybe I wasn't clear enough in my original question, sorry (will edit to add examples of the results I got/am looking for), but I'm trying to run individual regressions for each variable (e.g., disp, hp, drat) as well as for each group. This answer doesn't seen to differentiate between variables in the results. But I'll explore group_modify() more, see if I can modify this somehow.

andrew_reece Over a year ago

Ah, I see. Ok, update coming - might need nested map

Jaken Over a year ago

This does what I needed, thank you! This is outside the scope of the original question, so I'll mark it answered regardless, but: my real data is messy and contains some groups that are entirely NA. Normally I'd add a line like "if (all(is.na(iv)) {return(NA)} else" right before my lm function, but I have no idea how to do that nested inside map, do you have any ideas for that? Thanks regardless!

Jaken Over a year ago

This is almost certainly not the most elegant solution, but for anyone who ever stumbles across this same issue, this is what I came up with to deal with the na issue: map_df(ivs, function(iv) { tmpvar<-eval(parse(text = paste0("data$",iv))) if(all(is.na(tmpvar))) {return(NA)} else{ frml <- as.formula(paste("mpg", "~", iv)) lm(frml, data = data) %>% broom::glance() }}, .id = "iv")

Jaken Over a year ago

Correcting my attempt, the correct solution to this is in this follow-up question: stackoverflow.com/questions/70344641/…

Collectives™ on Stack Overflow

Running single linear regressions across multiple variables, in groups

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related