Recursive assignment in data.table

Question

Is it possible in data.table to perform recursive assignment of multiple columns? By recursive I mean that the next assignment depends on the previous assignment:

library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]

# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val),  : 
#   cannot coerce type 'builtin' to vector of type 'double'

Of course, one can do the assignments individually, but I guess the overhead cost (e.g. grouping) wouldn't be shared among the operations:

DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
#    id val cumsum cumsumofcumsum
# 1:  A   1      1              1
# 2:  A   2      3              4
# 3:  B   3      3              3
# 4:  B   4      7             10
# 5:  C   5      5              5
# 6:  C   6     11             16
# 7:  D   7      7              7
# 8:  D   8     15             22

agstudy · Accepted Answer · 2014-10-04 12:05:19Z

6

You can use a temporary variable and use it again for others variables:

DT[, c("cumsum", "cumsumofcumsum"):={
              x <- cumsum(val)
              list(x, cumsum(x))
              }, by=id]

Of course you can use dplyr and use your data.table as a backend, but I am not sure that you will get the same performance as the pure data.table method:

library(dplyr)
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = cumsum(val),
       cum2 = cumsum(cum1)
)

EDIT add some benchamrks:

Pure data.table solution is 5 times faster than dplyr one. I guess the sort in dplyr behind the scene can explain this difference.

f_dt <- 
  function(){
DT[, c("cumsum", "cumsumofcumsum"):={
  x <- as.numeric(cumsum(val))
  list(x, cumsum(x))
}, by=id]
}

f_dplyr <- 
  function(){
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = as.numeric(cumsum(val)),
       cum2 = cumsum(cum1)
)
}


library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),times = 100)
    expr       min       lq    median        uq       max neval
    f_dt()  2.580121  2.97114  3.256156  4.318658  13.49149   100
 f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626   100

edited Oct 4, 2014 at 12:05

answered Oct 4, 2014 at 11:18

agstudy

122k18 gold badges205 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Daniel Krizian Over a year ago

Thanks. Do you also get warnings: "In [.data.table(DT, , :=(c("cumsum", "cumsumofcumsum"), ... : integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'" when going for larger dataset? DT = data.table(id=rep(LETTERS[1:20], each=1000), val=1:20000)

Arun Over a year ago

Also note that dplyr sets key (instead of using adhoc-by) on both summarise and mutate, which'll always result in sorted output.

agstudy Over a year ago

@DanielKrizian I edit the answer and add benchmarking.

Arun Over a year ago

Also #614 when fixed will improve performance.

Collectives™ on Stack Overflow

Recursive assignment in data.table

1 Answer 1

EDIT add some benchamrks:

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

EDIT add some benchamrks:

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related