Using an if-statement inside purrr::map to avoid missing data errors

Question

I'm using purrr to run a series of single linear regressions across multiple columns of a grouped dataset, but am having trouble excluding groups of variables that have no data without deleting the entire group.

Thanks to andrew_reece here, I got the base code working as:

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

which gives a tibble in this format:

cyl iv      r.squared       p.value
4   disp    0.6484051396    0.002782827 
4   hp      0.2740558319    0.098398581 
4   drat    0.180           0.193  
4   wt      0.509           0.0137 
4   qsec    0.0557          0.485  
4   vs      0.00238         0.887  
...
6   disp    0.0106260401    0.825929685 
...

Unfortunately, my real dataset is messy and contains multiple group-variable combinations with only NAs, or with less than two real values, which lm can't handle. To show this, here is mtcars with some data in 'disp' replaced with NA. Run through the above code, mtna throws a NA-error.

#create mtcars dataset that will have a cyl group with entirely NA disp
mtna <- mtcars 
mtna$disp[mtna$disp < 147] <- NA

test <- mtna %>% group_by(cyl) %>% summarize(mean = mean(disp))

I tried to deal with this by making the lm conditional, and first using sum(!is.na) to check if there are enough real values to run lm. This allows the lm to run successfully.

mtna %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      tmpvar <- eval(parse(text = paste0("data$", iv)))
      if(sum(!is.na(tmpvar)) < 3) {return(NA)} else {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }}, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

#which gives:
       cyl    iv        r.squared  p.value
 1     4      NA        NA         NA     
 2     6      disp      0.0115     0.840 
 3     6      hp        0.0161     0.786 
 4     6      drat      0.0132     0.807 
...

However, when you look at the results, you can see that the NA has extended to the whole group, including variables other than disp (which is the only one that had missing values). There is now no data related to cyl = 4 at all, even in groups like hp and drat, which had no missing data.

What I was hoping for was something like:

cyl    iv        r.squared      p.value
4      disp      NA             NA 
4      hp        0.2740558319   0.098398581   # Currently missing
4      drat      0.1799791311   0.193450651   # Currently missing     
4      wt        0.5086325963   0.013742782.  # Currently missing
... 
6      disp      0.0106260401   0.825929685 
6      hp        0.0161462379   0.78602021  
...

I suspect this has something to do with the data format - I guess I'm mapping NA across all the results for that group, instead of just that one variable. But I have no idea how to address this. Any help is greatly appreciated!

Not having had any experience with group_modify(), I think your own guess is to the point: following its documentation, your lambda function inside the map_df() should return a dataframe. While the return(NA) might/is/will be cast to a data frame, it might very well be that purrr's magic is thrown off by this use case. Personally, I would go for a more completely purrr styled solution by adding list variables in your original data frame. — Paul Lemmens
– Paul Lemmens, Commented Dec 14, 2021 at 7:23
@PaulLemmens Can you explain what you mean by "adding list variables" in the original df? There's a working solution for this question in the answers, but I'm new to purrr and would like to improve the way I'm using it! — Jaken
– Jaken, Commented Dec 14, 2021 at 15:09
Sorry, I missed your reply. I meant following the examples all the way at the bottom of purrr.tidyverse.org/reference/map.html. — Paul Lemmens
– Paul Lemmens, Commented Dec 26, 2021 at 12:09

storm surge · Accepted Answer · 2021-12-14 08:54:15Z

I think you are almost there. Take a look at the code below. Changes are commented.

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtna <- mtcars 
mtna$disp[mtna$disp < 147] <- NA

mtna  %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      tmpvar<-eval(parse(text = paste0("data$", iv)))
      if(!is.na(sum(tmpvar))) {                          #only use complete data
        frml <- as.formula(paste("mpg", "~", iv))
        lm(frml, data = data) %>% broom::glance()
      }}, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value) %>% 
  right_join(.,expand.grid(cyl=unique(mtna$cyl),iv=ivs),
             by=c("cyl","iv")) %>%                       #populating with NA for columns lost before
  arrange(., cyl, iv) %>%                                #sort by cyl and iv
  as.data.frame()
  

  #which gives:
       cyl   iv    r.squared    p.value
    1    4   am 0.2872892493 0.08921640
    2    4 carb 0.0378466325 0.56650426
    3    4 disp           NA         NA
    4    4 drat 0.1799791311 0.19345065
    5    4 gear 0.1146552225 0.30840242
    6    4   hp 0.2740558319 0.09839858
    7    4 qsec 0.0556742395 0.48487154
    8    4   vs 0.0023819528 0.88668635
    9    4   wt 0.5086325963 0.01374278
    10   6   am 0.2810551424 0.22094408
    11   6 carb 0.0000661434 0.98619369
    12   6 disp           NA         NA
    13   6 drat 0.0131597553 0.80653099
    14   6 gear 0.0000901510 0.98388187
    15   6   hp 0.0161462379 0.78602021
    16   6 qsec 0.1753245893 0.34980138
    17   6   vs 0.2810551424 0.22094408
    18   6   wt 0.4645101506 0.09175766
    19   8   am 0.0024647887 0.86615546
    20   8 carb 0.1550637156 0.16359436
    21   8 disp 0.2701577717 0.05677488
    22   8 drat 0.0022975223 0.87074078
    23   8 gear 0.0024647887 0.86615546
    24   8   hp 0.0804491933 0.32575378
    25   8 qsec 0.0108860059 0.72261712
    26   8   vs 0.0000000000         NA
    27   8   wt 0.4229655365 0.01179281

This works perfectly, thank you! It seems the key is simply changing the if statement to run only when NAs are not present, without returning anything in the opposite case. The right_join fleshes out the data frame so missing variables are present (with NAs), which is a bonus, but the code returns results for all not-NA variables just by changing the if statement.

Collectives™ on Stack Overflow

Using an if-statement inside purrr::map to avoid missing data errors

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related