How to use a named variable within a function

Question

Assume the following dummy data frame:

dt <- data.table(A=c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"), 
             B=c("e", "e", "e", "e", "e", "e", "f", "f", "f", "f", "f", "f"), 
             C=1:12, 
             D=13:24)

I'd like to calculate some stadistics (say, mean and standard deviation) per each numeric column ("C" and "D") and each time grouped by the factor columns c("A"), c("B"), and c("A", "B). In the actual data frame, I have about 40 numeric columns, 10 factor columns that group in different combinations and a large list of statistics I'd like to calculate. Based on the answer (by @thelatemail) I got from a previous question, I know I can use the code below to deal with factor groupings (by=) using a list:

groupList <- list(c("A", "B"), c("A"), c("B"))
out <- vector("list", 3)
out <- lapply(
  groupList,
  function(x) {
    dt[, .(mean=mean(C), sd=sd(C)), by=x]
  }
)

Now I'd like to go a step further and create a variable containing a list of the names of numeric columns in the data frame and use the name of that variable within the function above. I came out with the following code but unfortunately, it doesn't work. My idea is to use a loop to extract a value from measureList at each turn and place that value within the mean, sd functions. Any ideas? The loop is how I tend to think of these things but I'll be glad to get rid of it if it makes the code faster or more efficient (particularly because one of the factor columns I have has 90 levels). I'd appreciate any pointer to solve this problem! Thanks.

factorList <- list(c("A"), c("B"), c("A", "B"))
measureList <- list(c("C"), c("D"))

out <- vector("list", 2)
for(i in 1:length(measureList)){
  out[[i]] <-lapply(
    factorList,
    function(x) {
      dt[, .(mean=mean(eval(measureList[[i]])), 
             sd=sd(eval(measureList[[i]]))),
         by = x]
    }
  )
}

Have you checked out dplyr? I'm pretty sure this can be done via a simple group_by() %>% summary() combo — A Duv
– A Duv, Commented Jun 28, 2018 at 22:43
Re your first lapply, I think the recently added groupingsets function should help. For example: stackoverflow.com/q/48547311 Regarding the eval thing in the second code chunk, that will make it quite a bit less efficient. See ?GForce and try running with DT[, ..., by=..., verbose=TRUE] to see if optimization is used. — Frank
– Frank, Commented Jun 29, 2018 at 14:15

Jaap · Accepted Answer · 2018-06-29 15:04:47Z

Another possibility is to use the new groupingsets function from data.table:

groupingsets(dt
             , j = lapply(.SD, function(x) list(mean(x), sd(x)))
             , by = c('A','B')
             , sets = factorList)[, type := c('mean','sd')][]

which gives:

      A    B        C        D type
 1:    a <NA>        2       14 mean
 2:    a <NA>        1        1   sd
 3:    b <NA>        5       17 mean
 4:    b <NA>        1        1   sd
 5:    c <NA>        8       20 mean
 6:    c <NA>        1        1   sd
 7:    d <NA>       11       23 mean
 8:    d <NA>        1        1   sd
 9: <NA>    e      3.5     15.5 mean
10: <NA>    e 1.870829 1.870829   sd
11: <NA>    f      9.5     21.5 mean
12: <NA>    f 1.870829 1.870829   sd
13:    a    e        2       14 mean
14:    a    e        1        1   sd
15:    b    e        5       17 mean
16:    b    e        1        1   sd
17:    c    f        8       20 mean
18:    c    f        1        1   sd
19:    d    f       11       23 mean
20:    d    f        1        1   sd

Onyambu · Accepted Answer · 2018-06-29 20:54:37Z

1

You can use outer with a vectorized function or use Map as shown below:

m = function(x,y)dt[, .(mean=mean(get(y)), sd=sd(get(y))), by=x]

c(outer(factorList,measureList,Vectorize(m)))

or

Map(m,rep(factorList,each=length(measureList)),measureList)

EDIT:

TO HAVE THE NAMES:

m = function(x,y)setNames(dt[, .(mean(get(y)),sd(get(y))), by=x],
                          c(head(names(dt),length(x)),paste(c("mean","sd"),y,sep="_")))

c(outer(factorList,measureList,Vectorize(m)))

edited Jun 29, 2018 at 20:54

answered Jun 29, 2018 at 0:24

Onyambu

80.3k3 gold badges29 silver badges65 bronze badges

1 Comment

Jose Over a year ago

Thanks! It works really well. I'm still new with functions but I was wondering whether it could be possible for the column names in each data frame created to reflect the element of 'measureList' that was computed. Thus, instead of having a general column named 'mean' or 'sd', it could say 'mean_C', 'sd_C' or 'mean_D', 'sd_D' according to each case. Otherwise, the result is a list of data frames all looking alike and hard to tell which comes from computing column C and which one from column D.

zack · Accepted Answer · 2018-06-28 22:51:45Z

1

This uses dplyr and purrr, but I think it works.

library(dplyr)
library(purrr)

combos <- expand.grid(factorList, measureList)
map2(combos[, 1],
     combos[, 2],
     ~ dt %>% group_by_at(.x) %>% summarize_at(.y, funs(mean, sd)))

answered Jun 28, 2018 at 22:51

zack

5,4352 gold badges23 silver badges26 bronze badges

Collectives™ on Stack Overflow

How to use a named variable within a function

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related