How do I apply a function to row subsets of a data.table where each call returns a data.table

Question

Here's a data.table

dt <- data.table(group = c("a","a","a","b","b","b"), x = c(1,3,5,1,3,5), y= c(3,5,8,2,8,9))
dt
   group x y
1:     a 1 3
2:     a 3 5
3:     a 5 8
4:     b 1 2
5:     b 3 8
6:     b 5 9

And here's a function that operates on a data.table and returns a data.table

myfunc <- function(dt){
  # Hyman spline interpolation (which preserves monotonicity)

  newdt <- data.table(x = seq(min(dt$x), max(dt$x)))
  newdt$y <- spline(x = dt$x, y = dt$y, xout = newdt$x, method = "hyman")$y
  return(newdt)
}

How do I apply myfunc to each subset of dt defined by the "group" column? In other words, I want an efficient, generalized way to do this

result <- rbind(myfunc(dt[group=="a"]), myfunc(dt[group=="b"]))
result
    x     y
 1: 1 3.000
 2: 2 3.875
 3: 3 5.000
 4: 4 6.375
 5: 5 8.000
 6: 1 2.000
 7: 2 5.688
 8: 3 8.000
 9: 4 8.875
10: 5 9.000

EDIT: I've updated my sample dataset and myfunc because I think it was initially too simplistic and invited work-arounds to the actual problem I'm trying to solve.

You function creates unnecessary copies, Just do dt[, .(x = seq(min(x), max(x) + 1), y = rep(y, each = 2)), by = group] — David Arenburg
– David Arenburg, Commented Mar 31, 2015 at 21:09
Alternately, define your function as following myfunc <- function(x, y){ list(x = seq(min(x), max(x)+1), y = rep(y, each=2))} and then do dt[, myfunc(x, y), by = group] — David Arenburg
– David Arenburg, Commented Mar 31, 2015 at 21:12
@Ben, @DavidArenburg 's comment still holds. Have your function return a list, not a data.table, and do dt[, myfunc(x, y), by = group]. — Frank
– Frank, Commented Mar 31, 2015 at 21:19

Ben Bolker · Accepted Answer · 2015-04-05 13:05:06Z

7

The whole idea of data.table is being both memory efficient and fast. Thus we never use $ within the data.table scope (only in very rare situations) and we don't create data.table objects within data.tables environment (currently, even .SD has an overhead).

In your case you can take advantage of data.table's non-standard evaluation capabilities and define your function as follows

myfunc <- function(x, y){
   temp = seq(min(x), max(x))
   y = spline(x = x, y = y, xout = temp, method = "hyman")$y
   list(x = temp, y = y)
}

Then the implementation within the dt scope is straight forward

dt[, myfunc(x, y), by = group]
#     group x      y
#  1:     a 1 3.0000
#  2:     a 2 3.8750
#  3:     a 3 5.0000
#  4:     a 4 6.3750
#  5:     a 5 8.0000
#  6:     b 1 2.0000
#  7:     b 2 5.6875
#  8:     b 3 8.0000
#  9:     b 4 8.8750
# 10:     b 5 9.0000

edited Apr 5, 2015 at 13:05

Ben Bolker

230k26 gold badges405 silver badges497 bronze badges

answered Mar 31, 2015 at 21:31

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Frank Over a year ago

NSE is "non-standard evaluation", eh? So suggests google, anyway.

Collectives™ on Stack Overflow

How do I apply a function to row subsets of a data.table where each call returns a data.table

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related