6

Is it possible in data.table to perform recursive assignment of multiple columns? By recursive I mean that the next assignment depends on the previous assignment:

library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]

# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val),  : 
#   cannot coerce type 'builtin' to vector of type 'double'

Of course, one can do the assignments individually, but I guess the overhead cost (e.g. grouping) wouldn't be shared among the operations:

DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
#    id val cumsum cumsumofcumsum
# 1:  A   1      1              1
# 2:  A   2      3              4
# 3:  B   3      3              3
# 4:  B   4      7             10
# 5:  C   5      5              5
# 6:  C   6     11             16
# 7:  D   7      7              7
# 8:  D   8     15             22

1 Answer 1

6

You can use a temporary variable and use it again for others variables:

DT[, c("cumsum", "cumsumofcumsum"):={
              x <- cumsum(val)
              list(x, cumsum(x))
              }, by=id]

Of course you can use dplyr and use your data.table as a backend, but I am not sure that you will get the same performance as the pure data.table method:

library(dplyr)
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = cumsum(val),
       cum2 = cumsum(cum1)
)

EDIT add some benchamrks:

Pure data.table solution is 5 times faster than dplyr one. I guess the sort in dplyr behind the scene can explain this difference.

f_dt <- 
  function(){
DT[, c("cumsum", "cumsumofcumsum"):={
  x <- as.numeric(cumsum(val))
  list(x, cumsum(x))
}, by=id]
}

f_dplyr <- 
  function(){
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = as.numeric(cumsum(val)),
       cum2 = cumsum(cum1)
)
}


library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),times = 100)
    expr       min       lq    median        uq       max neval
    f_dt()  2.580121  2.97114  3.256156  4.318658  13.49149   100
 f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626   100
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. Do you also get warnings: "In [.data.table(DT, , :=(c("cumsum", "cumsumofcumsum"), ... : integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'" when going for larger dataset? DT = data.table(id=rep(LETTERS[1:20], each=1000), val=1:20000)
Also note that dplyr sets key (instead of using adhoc-by) on both summarise and mutate, which'll always result in sorted output.
@DanielKrizian I edit the answer and add benchmarking.
Also #614 when fixed will improve performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.