r data.table functional programming / metaprogramming / computing on the language

Question

I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to

performance (does the implementation matter with respect to potential optimization that data.table may apply)
readability (is there a commonly agreed standard e.g. in most packages utilizing data.table)
ease of generalization (are there differences in the way metaprogramming is "generalizable")

The basic application is to aggregate a table flexibly, i.e. parameterizing the variables to aggregate, the dimensions to aggregate by, the respective resulting variable names of both and the aggregation function. I have implemented (nearly) the same function in three data.table and one dplyr way:

fn_dt_agg1 (here I couldn't figure out how parameterize the aggregation function)
fn_dt_agg2 (inspired by @jangorecki 's answer here which he calls "computing on the language")
fn_dt_agg3 (inspired by @Arun 's answer here which seems to be another approach of metaprogramming)
fn_df_agg1 (my humble approach of the same in dplyr)

libraries

library(data.table)
library(dplyr)

data

n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <- 
  data.frame(
    a = sample_metrics,
    b = sample_metrics,
    c = sample_dimensions,
    d = sample_dimensions,
    x = sample_metrics,
    y = sample_dimensions,
    stringsAsFactors = F)

dt <- as.data.table(df)

implementations

1. fn_dt_agg1

fn_dt_agg1 <- 
  function(dt, metric, metric_name, dimension, dimension_name) {

  temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt1 <- 
  fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"))

2. fn_dt_agg2

fn_dt_agg2 <- 
  function(dt, metric, metric_name, dimension, dimension_name,
           agg_type) {

  j_call = as.call(c(
    as.name("."),
    sapply(setNames(metric, metric_name), 
           function(var) as.call(list(as.name(agg_type), 
                                      as.name(var), na.rm = T)), 
           simplify = F)
    ))

  dt[, eval(j_call), keyby = dimension][]
}

res_dt2 <- 
  fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = c("sum"))

all.equal(res_dt1, res_dt2)
#TRUE

3. fn_dt_agg3

fn_dt_agg3 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

  e <- eval(parse(text=paste0("function(x) {", 
                              agg_type, "(", "x, na.rm = T)}"))) 

  temp <- dt[, setNames(lapply(.SD, e), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt3 <- 
  fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt3)
#TRUE

4. fn_df_agg1

fn_df_agg1 <-
  function(df, metric, metric_name, dimension, dimension_name, agg_type) {

    all_vars <- c(dimension, metric)
    all_vars_new <- c(dimension_name, metric_name)
    dots_group <- lapply(dimension, as.name)

    e <- eval(parse(text=paste0("function(x) {", 
                                agg_type, "(", "x, na.rm = T)}")))

    df %>%
      select_(.dots = all_vars) %>%
      group_by_(.dots = dots_group) %>%
      summarise_each_(funs(e), metric) %>%
      rename_(.dots = setNames(all_vars, all_vars_new))
}

res_df1 <- 
  fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = "sum")

all.equal(res_dt1, as.data.table(res_df1))
#"Datasets has different keys. 'target': c, d. 'current' has no key."

benchmarking

Just out of curiosity and for my future self and other interested parties, I ran a benchmark of all 4 implementations which potentially already sheds light on the performance issue (although I'm not a benchmarking expert so please excuse if I haven't applied commonly agreed best practices). I was expecting fn_dt_agg1 to be the fastest as it has one parameter less (aggregation function) but that doesn't seem to have a sizable impact. I was also surprised by the relatively slow dplyr function but this may be due to a bad design choice on my end.

library(microbenchmark)
bench_res <- 
  microbenchmark(
    fn_dt_agg1 = 
      fn_dt_agg1(
      dt = dt, metric = c("a", "b"), 
      metric_name = c("a", "b"), 
      dimension = c("c", "d"), 
      dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
      fn_dt_agg2(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"), 
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_dt_agg3 =
      fn_dt_agg3(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"),
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_df_agg1 =
      fn_df_agg1(
        df = df, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = "sum"),
    times = 100L)

bench_res

# Unit: milliseconds
#       expr      min       lq     mean   median       uq       max neval
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975   100
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064  84.56375   100
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715   100
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838   100

other resources

re: agg2 "which he calls 'computing on the language'" - not me but official R lang definition which you linked at the bottom. — jangorecki
– jangorecki, Commented Dec 29, 2016 at 17:39
@Triamus You may check data.table v1.14.1devel, item 10: "A new interface for programming on data.table has been added" — Henrik
– Henrik, Commented Jul 31, 2021 at 5:22

Roland · Accepted Answer · 2016-12-29 09:15:28Z

6

I don't recommend eval(parse()). You can achieve the same as in approach three without it:

fn_dt_agg4 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

    e <- function(x) getFunction(agg_type)(x, na.rm = T)

    temp <- dt[, setNames(lapply(.SD, e), 
                          metric_name), 
               keyby = dimension, .SDcols = metric]
    temp[]
  }

This also avoids some security risks.

PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE).

edited Dec 29, 2016 at 9:15

answered Dec 29, 2016 at 9:10

Roland

134k12 gold badges203 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Axeman Over a year ago

Is there an important difference between getFunction and match.fun?

Triamus Over a year ago

nice. I didn't know about getFunction. haven't seen it anywhere else so far. but why would eval(parse)) not be recommended? I had seen it in other answers from @Matt Dowle here and @Arun here

Roland Over a year ago

@Axeman I don't know. The latter allows input other than characters.

Roland Over a year ago

@Triam In the first post it isn't from Matt, in the second post Arun refers to the question which uses it. R allows computing on the language, so you don't need it. Eval/parsing arbitrary expressions adds security risks, can be slow (not in the example here), and is impossible to debug.

Roland Over a year ago

Yes, eval/parsing is not computing on the language. If you consider using parse reread fortune 106. There are valid uses of parse, but they are rare.

|

Collectives™ on Stack Overflow

r data.table functional programming / metaprogramming / computing on the language

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related