4

I've read many posts on passing column names to a data.table function, but I did not see a post dealing with passing multiple variables to "by". I commonly use code like this to calculate summary statistics by group.

# Data
library(data.table)
dt=mtcars
setDT(dt)

# Summary Stats Example
dt[cyl==4,.(Count=.N,
    Mean=mean(hp),
    Median=median(hp)),
    by=.(am,vs)]

#    am vs Count   Mean Median
# 1:  1  1     7 80.571     66
# 2:  0  1     3 84.667     95
# 3:  1  0     1 91.000     91

I can't get the following function to work:

# Function
myFun <- function(df,i,j,by){
    df[i==4,.(Count=.N,
      Mean=mean(j),
      Median=median(j)),
      by=.(am,by)]
}
myFun(dt,i='cyl',j='hp',by='vs')

Note that I hard-coded "4" and "am" into the function for this example. get() worked when only using 1 by grouping variable, but failed when multiple grouping variables are used. Guidance on how to properly use get/quote/eval/substitute/parse/as.name/etc when writing data.table functions is appreciated.

1

2 Answers 2

1

Just create a character vector for by part of data.table, it will work:

myFun <- function(df, i, j, by){

 df[get(i) == 4, .(Count = .N, 
           Mean = mean(get(j)),
           Median = median(get(j))),
  by = c(by, 'am')]
}



myFun(dt, i = 'cyl', j = 'hp', by = 'vs')

#vs am Count     Mean Median
#1:  1  1     7 80.57143     66
#2:  1  0     3 84.66667     95
#3:  0  1     1 91.00000     91
Sign up to request clarification or add additional context in comments.

3 Comments

eval(by) is not necessary.
Thanks @sm925 and @sindri_baldur. I noticed that the above code changed the by argument from a list to a vector. My example does not show it, but I typically apply criteria (eg grp>2) in the by argument, so for my general purposes I need to use by=.().
Your responses helped me craft the following cheat sheet: - Pass i, j, and by variables using get(var) - Pass i or by criteria directly The above assumes by is a list. The above may fail or be considered bad practice in more complicated scenarios. For example, I use merge() instead of taking advantage of [ to join two data.tables.
0

I've accepted sm95's answer. Below is a more complex example/solution that sends a list to the by argument:

# Libraries
library(data.table)

# Data
dt = mtcars
setDT(dt)

# Function to calculate summary statistics
myFun <- function(df, i1var, i1val, i2var, i2val,            # i arguments
                                    j,                       # j arguments
                                    by1var, by2var, by2val){ # by arguments
    df[get(i1var) == i1val & get(i2var) %in% i2val,
         .(Count = .N,
            Mean = mean(get(j)),
            Median = median(get(j))),
        by = .(get(by1var), get(by2var) == by2val)]
} # END Function

# Run function
myFun(dt,i1var = 'cyl', i1val = 4, i2var = 'gear', i2val = c(3,4),
            j = 'hp',
            by1var = 'vs', by2var = 'am', by2val = 1)
#    vs am Count     Mean Median
# 1:  1  1     6 75.16667     66
# 2:  1  0     3 84.66667     95

# Should match
dt[cyl == 4 & gear %in% c(3,4),
     .(Count = .N,
        Mean = mean(hp),
        Median = median(hp)),
     by = .(vs, am == 1)]
#    vs am Count     Mean Median
# 1:  1  1     6 75.16667     66
# 2:  1  0     3 84.66667     95

Here is my Cheat Sheet:

  • Pass i, j, and by variables using get(var)
  • Pass i or by criteria directly

The above may not apply to more complex functions, and may not be optimal.

If by is a vector and NOT a list (eg, by=c() vs by=.()), then by arguments can be passed directly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.