84

I have a function that returns two values in a list. Both values need to be added to a data.table in two new columns. Evaluation of the function is costly, so I would like to avoid having to compute the function twice. Here's the example:

library(data.table)
example(data.table)
DT
   x y  v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1  4
5: b 3  5
6: b 6  6
7: c 1  7
8: c 3  8
9: c 6  9

Here's an example of my function. Remember I said it's costly compute, on top of that there is no way to deduce one return value from the other given values (as in the example below):

myfun <- function (y, v) 
{
ret1 = y + v
ret2 = y - v
return(list(r1 = ret1, r2 = ret2))
}

Here's my way to add two columns in one statement. That one needs to call myfun twice, however:

DT[,new1:=myfun(y,v)$r1][,new2:=myfun(y,v)$r2]

   x y  v new1 new2
1: a 1 42   43  -41
2: a 3 42   45  -39
3: a 6 42   48  -36
4: b 1  4    5   -3
5: b 3  5    8   -2
6: b 6  6   12    0
7: c 1  7    8   -6
8: c 3  8   11   -5
9: c 6  9   15   -3

Any suggestions on how to do this? I could save r2 in a separate environment each time I call myfun, I just need a way to add two columns by reference at a time.

2
  • Why not have your function take in a data frame and return a data frame directly? `myfun <- function (y, v) { ret1 = y + v ret2 = y - v return(list(r1 = ret1, r2 = ret2)) } Commented Jul 4, 2012 at 18:55
  • 3
    @Etienne Because that copies the inputs to create a new output. Florian is using data.table for its memory efficiency with large datasets; it doesn't copy x,y or v at all, even once. Think 20GB datasets in RAM. Commented Jul 5, 2012 at 8:58

5 Answers 5

108

Since data.table v1.8.3, you can do this:

DT[, c("new1","new2") := myfun(y,v)]

Another option is storing the output of the function and adding the columns one-by-one:

z <- myfun(DT$y,DT$v)
head(DT[,new1:=z$r1][,new2:=z$r2])
#      x y  v new1 new2
# [1,] a 1 42   43  -41
# [2,] a 3 42   45  -39
# [3,] a 6 42   48  -36
# [4,] b 1  4    5   -3
# [5,] b 3  5    8   -2
# [6,] b 6  6   12    0
Sign up to request clarification or add additional context in comments.

9 Comments

wow, that second one is amazing, thanks! just ran it with debug(myfun) to see how often it gets called: it's once. great.
+10 from me too. I've just raised FR#2120 to "Drop needing with=FALSE for LHS of :="
Note that list recycling is also done; e.g., c("a","b","c","d"):=list(1,2) puts 1 into a and c, and 2 into b and d. If any of the columns don't exist they'll be added by reference. Not sure how useful := recycling is in practice. It's more for c("a","b","c"):=NULL which deletes those 3 columns. Internally that's a recycle of NULL to a (semantic) list length 3.
@MatthewDowle oh yes, just wanted to ask that. the c("a","b","c"):=NULL is very useful.
another useful := usage can be `:=`(colname=colvalue,...). I often prefer this one because you might just replace := with list to have a read-only preview of data to be written by reference when := used.
|
4

The answer can not be used such as when the function is not vectorized.

For example in the following situation it will not work as intended:

myfun <- function (y, v, g) 
{
  ret1 = y + v + length(g)
  ret2 = y - v + length(g)
  return(list(r1 = ret1, r2 = ret2))
}
DT
#    v y                  g
# 1: 1 1                  1
# 2: 1 3                4,2
# 3: 1 6              9,8,6

DT[,c("new1","new2"):=myfun(y,v,g)]
DT
#    v y     g new1 new2
# 1: 1 1     1    5    3
# 2: 1 3   4,2    7    5
# 3: 1 6 9,8,6   10    8

It will always add the size of column g, not the size of each vector in g

A solution in such case is:

DT[, c("new1","new2") := data.table(t(mapply(myfun,y,v,g)))]
DT
#    v y     g new1 new2
# 1: 1 1     1    3    1
# 2: 1 3   4,2    6    4
# 3: 1 6 9,8,6   10    8

Comments

2

To build on the previous answer, one can use lapply with a function that output more than one column. It's is then possible to use the function with more columns of the data.table.

 myfun <- function(a,b){
     res1 <- a+b
     res2 <- a-b
     list(res1,res2)
 }

 DT <- data.table(z=1:10,x=seq(3,30,3),t=seq(4,40,4))
 DT

 ## DT
 ##     z  x  t
 ## 1:  1  3  4
 ## 2:  2  6  8
 ## 3:  3  9 12
 ## 4:  4 12 16
 ## 5:  5 15 20
 ## 6:  6 18 24
 ## 7:  7 21 28
 ## 8:  8 24 32
 ## 9:  9 27 36
 ## 10: 10 30 40

 col <- colnames(DT)
 DT[, paste0(c('r1','r2'),rep(col,each=2)):=unlist(lapply(.SD,myfun,z),
                                                   recursive=FALSE),.SDcols=col]
 ## > DT
 ##     z  x  t r1z r2z r1x r2x r1t r2t
 ## 1:  1  3  4   2   0   4   2   5   3
 ## 2:  2  6  8   4   0   8   4  10   6
 ## 3:  3  9 12   6   0  12   6  15   9
 ## 4:  4 12 16   8   0  16   8  20  12
 ## 5:  5 15 20  10   0  20  10  25  15
 ## 6:  6 18 24  12   0  24  12  30  18
 ## 7:  7 21 28  14   0  28  14  35  21
 ## 8:  8 24 32  16   0  32  16  40  24
 ## 9:  9 27 36  18   0  36  18  45  27
 ## 10: 10 30 40  20   0  40  20  50  30

Comments

1

In case a function return a matrix you can achieve the same behavior by wrapping the function with one converting the matrix into list first. I wonder if data.table should handle it automatically?

matrix2list <- function(mat){
unlist(apply(mat,2,function(x) list(x)),FALSE)
}

DT <- data.table(A=1:10)

myfun <- function(x) matrix2list(cbind(x+1,x-1))

DT[,c("c","d"):=myfun(A)]

##>DT
##      A  c d
##  1:  1  2 0
##  2:  2  3 1
##  3:  3  4 2
##  4:  4  5 3
##  5:  5  6 4
##  6:  6  7 5
##  7:  7  8 6
##  8:  8  9 7
##  9:  9 10 8
## 10: 10 11 9

Comments

-5

Why not have your function take in a data frame and return a data frame directly?

myfun <- function (DT) 
{
DT$ret1 = with(DT, y + v)
DT$ret2 = with(DT, y - v)
return(DT)
}

1 Comment

Because that copies the whole of DT, twice. Florian is using data.table for its memory efficiency with large datasets; it doesn't copy x,y or v at all, even once.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.