I'm using purrr to run a series of single linear regressions across multiple columns of a grouped dataset, but am having trouble excluding groups of variables that have no data without deleting the entire group.
Thanks to andrew_reece here, I got the base code working as:
library(tidyverse)
ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs
mtcars %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
which gives a tibble in this format:
cyl iv r.squared p.value
4 disp 0.6484051396 0.002782827
4 hp 0.2740558319 0.098398581
4 drat 0.180 0.193
4 wt 0.509 0.0137
4 qsec 0.0557 0.485
4 vs 0.00238 0.887
...
6 disp 0.0106260401 0.825929685
...
Unfortunately, my real dataset is messy and contains multiple group-variable combinations with only NAs, or with less than two real values, which lm can't handle. To show this, here is mtcars with some data in 'disp' replaced with NA. Run through the above code, mtna throws a NA-error.
#create mtcars dataset that will have a cyl group with entirely NA disp
mtna <- mtcars
mtna$disp[mtna$disp < 147] <- NA
test <- mtna %>% group_by(cyl) %>% summarize(mean = mean(disp))
I tried to deal with this by making the lm conditional, and first using sum(!is.na) to check if there are enough real values to run lm. This allows the lm to run successfully.
mtna %>%
group_by(cyl) %>%
group_modify(function(data, key) {
map_df(ivs, function(iv) {
tmpvar <- eval(parse(text = paste0("data$", iv)))
if(sum(!is.na(tmpvar)) < 3) {return(NA)} else {
frml <- as.formula(paste("mpg", "~", iv))
lm(frml, data = data) %>% broom::glance()
}}, .id = "iv")
}) %>%
select(cyl, iv, r.squared, p.value)
#which gives:
cyl iv r.squared p.value
1 4 NA NA NA
2 6 disp 0.0115 0.840
3 6 hp 0.0161 0.786
4 6 drat 0.0132 0.807
...
However, when you look at the results, you can see that the NA has extended to the whole group, including variables other than disp (which is the only one that had missing values). There is now no data related to cyl = 4 at all, even in groups like hp and drat, which had no missing data.
What I was hoping for was something like:
cyl iv r.squared p.value
4 disp NA NA
4 hp 0.2740558319 0.098398581 # Currently missing
4 drat 0.1799791311 0.193450651 # Currently missing
4 wt 0.5086325963 0.013742782. # Currently missing
...
6 disp 0.0106260401 0.825929685
6 hp 0.0161462379 0.78602021
...
I suspect this has something to do with the data format - I guess I'm mapping NA across all the results for that group, instead of just that one variable. But I have no idea how to address this. Any help is greatly appreciated!
group_modify(), I think your own guess is to the point: following its documentation, your lambda function inside themap_df()should return a dataframe. While thereturn(NA)might/is/will be cast to a data frame, it might very well be that purrr's magic is thrown off by this use case. Personally, I would go for a more completely purrr styled solution by adding list variables in your original data frame.