0

I have many dataframes that all have the same variables and structure. I would like to go from individual level data in each input dataframe, and using functions, summarize the input dataframe across all rows creating new variables. I.e., for every input dataframe, I would like to create an output dataframe with one row summarizing the variables named in regularVar_names for every specified age group, with age group being flexibly implemented. The function should estimate the number of rows where the variable is not-NA. Within the code below, I subset the variable X_AGE80 to be between 18-84. Ultimately I need this function to work for different age groups that are subsets of a master dataset of adults. Subsets include 18-20, 21-24, 25-84, 18 only, 19 only, etc. However, I was thinking along the lines of @margusl, and that this is easy to control from outside the function. It would be icing on the cake for the answer to account for age groups in an elegant way.

This is how I tried to implement it.

Data:

input.ds.2018 = data.frame(Var1 = c(1,1,NA,NA,1,2),Var2 = rep(c(1,2),3),V3 = c(NA,rep(2,4),1),
                         y_4 = c(NA,"y","z","l","m","n"),X_AGE80 = c(17,18,NA,84,21,72))

This is my attempted solution, but apparently . does not supply the input dataframe like I assumed.

calc_unwt_n_regularVar_fn = function(df,VAR){
  df %>% filter(!is.na(eval(parse(text = VAR)))) %>% nrow

}
# apply calc_unwt_n_regularVar_fn to age-group 18 to 84 for regular variables called Var1 and Var2
regularVar_names = c("Var1","Var2")
output = input.ds.2018 %>%
  filter(X_AGE80 <= 84) %>%
  filter(X_AGE80 >= 18)  %>%
  summarize(across(all_of(regularVar_names), ~ calc_unwt_n_regularVar_fn(.,cur_column()),.names = "unwt_denom_{.col}"))

However it thinks . is equivalent to cur_column(), so it throws an error:

Error in `summarize()`:
i In argument: `across(...)`.
Caused by error in `across()`:
! Can't compute column `unwt_denom_Var1`.
Caused by error in `UseMethod()`:
! no applicable method for 'filter' applied to an object of class "c('double', 'numeric')" 

I also tried replacing . with .data to try to pass in the input dataframe as a parameter, but that didnt' work either.

So my questions are: (1) How do I input the dataframe as a parameter to the function, "calc_unwt_n_regularVar_fn"? Or if this is a dumb way to go about it, (2) How should I implement creating new summary variables for each input dataframe and various age groups, where the summary variables are required for each input dataframe/age group combination.

5
  • What should calc_unwt_n_regularVar_fn calculate? I do not see where age group gets specified automatically. Is it always 18 to 84? Commented Jan 15 at 8:56
  • @Friede Age group gets specified by the two filter statements feeding into summarize(across() ~).: ``` output = input.ds.2018 %>% filter(X_AGE80 <= 84) %>% filter(X_AGE80 >= 18) %>% summarize(across(all_of(regularVar_names), ~ calc_unwt_n_regularVar_fn(.,cur_column(),"denom"),.names = "unwt_denom_{.col}")) ``` calc_unwt_n_regularVar_fn should calculate the number of rows of the input dataframe where VAR is not NA (specified by the "denom" parameter). Commented Jan 15 at 9:02
  • As I asked, it is alsways 19 to 84? What are you trying to attempt with calc_unwt_n_regularVar_fn, please explain. Your question is currently quite unclear. Also you are missing a second data frame to allow demonstration for i>1, e.g. on a list of data frames (you said you have many). Are they collected in a list? Commented Jan 15 at 9:03
  • @Friede No, it is not always 18 to 84. I need to be able to flexibly change the age range and specify age_lower (in this case 18) and age_upper (in this case 84). Currently, I just want it to work for one dataframe. Commented Jan 15 at 9:06
  • @Friede The function calculates the number of rows in the input dataframe where VAR is not NA Commented Jan 15 at 9:16

2 Answers 2

2

(A) In base, you can start from

  1. Set-up data
regularVar_names = c("Var1","Var2")

input.ds.2018 = data.frame(Var1 = c(1,1,NA,NA,1,2),
                           Var2 = rep(c(1,2),3),
                           V3 = c(NA,rep(2,4),1),
                           y_4 = c(NA,"y","z","l","m","n"), 
                           X_AGE80 = c(17,18,NA,84,21,72))

input.ds.2017 = data.frame(Var1 = c(1,NA,NA,NA,1,2),
                           Var2 = rep(c(1,2),3),
                           V3 = c(NA,rep(2,4),1),
                           y_4 = c(NA,"y","z","l","m","n"), 
                           X_AGE80 = c(17,18,NA,84,21,72))
  1. Collect data frames in list (if not already done)
l = mget(ls(pattern = "^input.ds.")) # could be more specific
  1. Define function
f = \(X, y, lower, upper) {
  stopifnot(c(y, "X_AGE80") %in% names(X))
  X = subset(X, X_AGE80 %in% lower:upper)
  vapply(X[y], \(i) length(i[!is.na(i)]), numeric(1L))
}
  1. Apply function f on list of data frames l
vapply(l, f, y = regularVar_names, lower = 18, upper = 84, numeric(length(l)))

     input.ds.2017 input.ds.2018
Var1             2             3
Var2             4             4

You might want to add |> as.data.frame().

  1. Optionally, we could rename Var1, Var2 to something else
 vapply(l, f, y = regularVar_names, lower = 18, upper = 84, numeric(length(l))) |> 
  `row.names<-`(paste0("unwt_denom_", regularVar_names))

What does it mean? It might be better to track those names as separate colum.

(B) With purrr + dplyr

library(dplyr)
library(purrr)
map_df(l, ~ { . |>
  filter(X_AGE80 %in% 18:84) |>
  summarise(across(regularVar_names, ~length(.[!is.na(.)]))) }, .id = "DF_name") }
)

or

f2 = \(X, y, lower, upper) {
  stopifnot(c(y, "X_AGE80") %in% names(X))
  X |> filter(X_AGE80 %in% 18:84) |>
    summarise(across(regularVar_names, ~length(.[!is.na(.)])))
}

map_df(l, f2, y = regularVar_names, lower = 18, upper = 84, .id = "DF_name")
           name Var1 Var2
1 input.ds.2017    2    4
2 input.ds.2018    3    4
Sign up to request clarification or add additional context in comments.

Comments

1

I'd just classify age groups (e.g. through a join or case_when()) and apply anonymous function in summarise(across(...))) by those groups, something like:

library(dplyr, warn.conflicts = FALSE)
age_groups <- tribble(
  ~grp, ~start, ~end,
  "<18",   0,  17,
  "18-84", 18, 84,
  ">84",   85, Inf 
)

inner_join(input.ds.2018, age_groups, by = join_by(between(X_AGE80, start, end))) %>% 
  # to present inner_join result
  print() %>%
  summarise(across(all_of(regularVar_names), \(x) sum(!is.na(x)), .names = "unwt_denom_{.col}"), .by = grp)

#> # A tibble: 5 × 8
#>    Var1  Var2    V3 y_4   X_AGE80 grp   start   end
#>   <dbl> <dbl> <dbl> <chr>   <dbl> <chr> <dbl> <dbl>
#> 1     1     1    NA <NA>       17 <18       0    17
#> 2     1     2     2 y          18 18-84    18    84
#> 3    NA     2     2 l          84 18-84    18    84
#> 4     1     1     2 m          21 18-84    18    84
#> 5     2     2     1 n          72 18-84    18    84

#> # A tibble: 2 × 3
#>   grp   unwt_denom_Var1 unwt_denom_Var2
#>   <chr>           <int>           <int>
#> 1 <18                 1               1
#> 2 18-84               3               4

But to answer your questions, you can pass data to a function with pick(), . in this context is not evaluated as a placeholder for magrittr pipe but it's part of the formula notation of across()'s function argument. I personally find anonymous function shorthand from base ( \(x) dosomething(x) ) to be less confusing.

# to pass data for ...
calc_unwt_n_regularVar_fn_df = function(data, VAR, age_min = 0, age_max = 999){
  # VAR is a promise, embrace it with {{}}, https://dplyr.tidyverse.org/articles/programming.html#indirection
  data %>%
    filter(between(X_AGE80, age_min, age_max), !is.na({{VAR}})) %>%
    nrow()
}

# ... use pick(), it will also correctly handle grouped data
input.ds.2018 %>%
  summarize(across(all_of(regularVar_names), 
                   \(x) calc_unwt_n_regularVar_fn_df(pick(x, X_AGE80), VAR = x, age_min = 18, age_max = 84), 
                   .names = "unwt_denom_{.col}"))
#> # A tibble: 1 × 2
#>   unwt_denom_Var1 unwt_denom_Var2
#>             <int>           <int>
#> 1               3               4

Example data:


input.ds.2018 = tibble(
  Var1 = c(1, 1, NA, NA, 1, 2),
  Var2 = rep(c(1, 2), 3),
  V3 = c(NA, rep(2, 4), 1),
  y_4 = c(NA, "y", "z", "l", "m", "n"),
  X_AGE80 = c(17, 18, NA, 84, 21, 72)
)

regularVar_names = c("Var1","Var2")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.