7

I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to

  • performance (does the implementation matter with respect to potential optimization that data.table may apply)
  • readability (is there a commonly agreed standard e.g. in most packages utilizing data.table)
  • ease of generalization (are there differences in the way metaprogramming is "generalizable")

The basic application is to aggregate a table flexibly, i.e. parameterizing the variables to aggregate, the dimensions to aggregate by, the respective resulting variable names of both and the aggregation function. I have implemented (nearly) the same function in three data.table and one dplyr way:

  1. fn_dt_agg1 (here I couldn't figure out how parameterize the aggregation function)
  2. fn_dt_agg2 (inspired by @jangorecki 's answer here which he calls "computing on the language")
  3. fn_dt_agg3 (inspired by @Arun 's answer here which seems to be another approach of metaprogramming)
  4. fn_df_agg1 (my humble approach of the same in dplyr)

libraries

library(data.table)
library(dplyr)

data

n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <- 
  data.frame(
    a = sample_metrics,
    b = sample_metrics,
    c = sample_dimensions,
    d = sample_dimensions,
    x = sample_metrics,
    y = sample_dimensions,
    stringsAsFactors = F)

dt <- as.data.table(df)

implementations

1. fn_dt_agg1

fn_dt_agg1 <- 
  function(dt, metric, metric_name, dimension, dimension_name) {

  temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt1 <- 
  fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"))

2. fn_dt_agg2

fn_dt_agg2 <- 
  function(dt, metric, metric_name, dimension, dimension_name,
           agg_type) {

  j_call = as.call(c(
    as.name("."),
    sapply(setNames(metric, metric_name), 
           function(var) as.call(list(as.name(agg_type), 
                                      as.name(var), na.rm = T)), 
           simplify = F)
    ))

  dt[, eval(j_call), keyby = dimension][]
}

res_dt2 <- 
  fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = c("sum"))

all.equal(res_dt1, res_dt2)
#TRUE

3. fn_dt_agg3

fn_dt_agg3 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

  e <- eval(parse(text=paste0("function(x) {", 
                              agg_type, "(", "x, na.rm = T)}"))) 

  temp <- dt[, setNames(lapply(.SD, e), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt3 <- 
  fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt3)
#TRUE

4. fn_df_agg1

fn_df_agg1 <-
  function(df, metric, metric_name, dimension, dimension_name, agg_type) {

    all_vars <- c(dimension, metric)
    all_vars_new <- c(dimension_name, metric_name)
    dots_group <- lapply(dimension, as.name)

    e <- eval(parse(text=paste0("function(x) {", 
                                agg_type, "(", "x, na.rm = T)}")))

    df %>%
      select_(.dots = all_vars) %>%
      group_by_(.dots = dots_group) %>%
      summarise_each_(funs(e), metric) %>%
      rename_(.dots = setNames(all_vars, all_vars_new))
}

res_df1 <- 
  fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = "sum")

all.equal(res_dt1, as.data.table(res_df1))
#"Datasets has different keys. 'target': c, d. 'current' has no key."

benchmarking

Just out of curiosity and for my future self and other interested parties, I ran a benchmark of all 4 implementations which potentially already sheds light on the performance issue (although I'm not a benchmarking expert so please excuse if I haven't applied commonly agreed best practices). I was expecting fn_dt_agg1 to be the fastest as it has one parameter less (aggregation function) but that doesn't seem to have a sizable impact. I was also surprised by the relatively slow dplyr function but this may be due to a bad design choice on my end.

library(microbenchmark)
bench_res <- 
  microbenchmark(
    fn_dt_agg1 = 
      fn_dt_agg1(
      dt = dt, metric = c("a", "b"), 
      metric_name = c("a", "b"), 
      dimension = c("c", "d"), 
      dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
      fn_dt_agg2(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"), 
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_dt_agg3 =
      fn_dt_agg3(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"),
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_df_agg1 =
      fn_df_agg1(
        df = df, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = "sum"),
    times = 100L)

bench_res

# Unit: milliseconds
#       expr      min       lq     mean   median       uq       max neval
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975   100
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064  84.56375   100
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715   100
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838   100

other resources

2
  • re: agg2 "which he calls 'computing on the language'" - not me but official R lang definition which you linked at the bottom. Commented Dec 29, 2016 at 17:39
  • 2
    @Triamus You may check data.table v1.14.1devel, item 10: "A new interface for programming on data.table has been added" Commented Jul 31, 2021 at 5:22

1 Answer 1

6

I don't recommend eval(parse()). You can achieve the same as in approach three without it:

fn_dt_agg4 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

    e <- function(x) getFunction(agg_type)(x, na.rm = T)

    temp <- dt[, setNames(lapply(.SD, e), 
                          metric_name), 
               keyby = dimension, .SDcols = metric]
    temp[]
  }

This also avoids some security risks.

PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE).

Sign up to request clarification or add additional context in comments.

7 Comments

Is there an important difference between getFunction and match.fun?
nice. I didn't know about getFunction. haven't seen it anywhere else so far. but why would eval(parse)) not be recommended? I had seen it in other answers from @Matt Dowle here and @Arun here
@Axeman I don't know. The latter allows input other than characters.
@Triam In the first post it isn't from Matt, in the second post Arun refers to the question which uses it. R allows computing on the language, so you don't need it. Eval/parsing arbitrary expressions adds security risks, can be slow (not in the example here), and is impossible to debug.
Yes, eval/parsing is not computing on the language. If you consider using parse reread fortune 106. There are valid uses of parse, but they are rare.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.