update data.table based on values in another data.table

Question

I'm attempting to update a data.table based on values in another table. I thought I had an approach (although not particularly efficient given my actual dt2 has 3.5M records) but it turns out something is going wrong with my code

In DT1 I have a number of variables (numeric) that I am wanting to set to 1 or 0 depending on what their classification value is.

Example of data:

dt1 <- data.table(urn = 1:10, V1=0,V2=0,V3=0)
##    urn V1 V2 V3
## 1:   1  0  0  0
## 2:   2  0  0  0
## 3:   3  0  0  0
## 4:   4  0  0  0
## 5:   5  0  0  0
## 6:   6  0  0  0
## 7:   7  0  0  0
## 8:   8  0  0  0
## 9:   9  0  0  0
##10:  10  0  0  0

dt2 <- data.table(urn=rep(1:10,2),classification=0)
dt2$classification <- 1:7 #does give a warning message
##    urn classification
## 1:   1              1
## 2:   2              2
## 3:   3              3
## 4:   4              4
## 5:   5              5
## 6:   6              6
## 7:   7              7
## 8:   8              1
## 9:   9              2
##10:  10              3
##11:   1              4
##12:   2              5
##13:   3              6
##14:   4              7
##15:   5              1
##16:   6              2
##17:   7              3
##18:   8              4
##19:   9              5
##20:  10              6

What I want to happen is where the urn matches, and the urn has a record with classification of 1, then set the V1 flag to 1; where the urn has a classification of 2, set V2 = 1; you get the idea.

My current approach has been:

setkey(dt1, urn)
setkey(dt2, urn)
dt1[dt2, V1:= ifelse(i.classification == 1 , 1, dt1$V1)]
dt1[dt2, V2:= ifelse(i.classification == 2, 1, dt1$V2)]
dt1[dt2, V3:= ifelse(i.classification == 6, 1, dt1$V3)]
dt1
##    urn V1 V2 V3
## 1:   1  1  0  0
## 2:   2  1  0  0
## 3:   3  1  0  1
## 4:   4  1  0  0
## 5:   5  1  0  0
## 6:   6  1  1  0
## 7:   7  1  0  0
## 8:   8  1  0  0
## 9:   9  1  0  0
##10:  10  1  0  1

What I actually expect the output to look like is:

##    urn V1 V2 V3
## 1:   1  1  0  0
## 2:   2  0  1  0
## 3:   3  0  0  1
## 4:   4  0  0  0
## 5:   5  1  0  0
## 6:   6  0  1  1
## 7:   7  0  0  0
## 8:   8  1  0  0
## 9:   9  0  1  0
##10:  10  0  0  1

Any help would be greatly appreciated. Particularly if I am able to simplify the command to manage all of these in a single scan rather than 3 joins of data. As I said 3.5M records adds a bit of overhead (particularly when I'll actually be setting flags for about 10 columns).

Thanks in advance.

Something like dcast(dt1[dt2,], urn~classification) might work — shadow
– shadow, Commented Sep 29, 2014 at 10:35

shadow · Accepted Answer · 2014-10-01 08:29:45Z

2

You can use dcast.data.table for this.

dcast.data.table(data = dt1[dt2,], 
      formula = urn ~ classification, 
      fun.aggregate = function(x) as.numeric(any(!is.na(x))), 
      value.var = 'classification')[, list(urn, V1=`1`, V2=`2`, V3=`6`)] 
##     urn V1 V2 V3
##  1:   1  1  0  0
##  2:   2  0  1  0
##  3:   3  0  0  1
##  4:   4  0  0  0
##  5:   5  1  0  0
##  6:   6  0  1  1
##  7:   7  0  0  0
##  8:   8  1  0  0
##  9:   9  0  1  0
## 10:  10  0  0  1

EDIT: You can wrap the last statement in tryCatch functions if necessary:

dcast.data.table(data = dt1[dt2,], 
                 formula = urn ~ classification, 
                 fun.aggregate = function(x) as.numeric(any(!is.na(x))), 
                 value.var = 'classification')[, list(urn, 
                                                      V1=tryCatch(`1`, error =     ## function(e) NA), 
                                                      V2=tryCatch(`2`, error = function(e) NA), 
                                                      V3=tryCatch(`8`, error = function(e) NA))] 
##     urn V1 V2 V3
##  1:   1  1  0 NA
##  2:   2  0  1 NA
##  3:   3  0  0 NA
##  4:   4  0  0 NA
##  5:   5  1  0 NA
##  6:   6  0  1 NA
##  7:   7  0  0 NA
##  8:   8  1  0 NA
##  9:   9  0  1 NA
## 10:  10  0  0 NA

edited Oct 1, 2014 at 8:29

answered Sep 29, 2014 at 12:02

shadow

22.4k5 gold badges67 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dan Over a year ago

This looks exceptionally promising, but I can't seem to get it to work with my real data. I do have a number of other columns in dt1 & dt2 in my real world data. Is this likely to have a negative impact on how the above works, I wouldn't have suspected so given that everything appears to be named in the above... really appreciate the help so far. [I'll be in meetings for the next several hours so it may take a while to respond]

shadow Over a year ago

Can you make a reproducible example, where this does not work? Or at least print the error message you get? It's very hard to diagnose the problem otherwise.

Dan Over a year ago

Thanks @shadow, I realised the error of my ways. I was checking the contents of dt1 again after I had run the code without having updated dt1. I've figured a way around that. Thanks I do have another extension of my question though, If instead of the value of 6 for V3 I wanted to use 8 for V3, I get an error because dt2 doesn't have any records with classification=8. I'm hoping to be able to make this a fairly generic piece of code, that would be able to account for the presence of a range of classifications, but won't fall over if those classifications aren't there. Any tips?

Collectives™ on Stack Overflow

update data.table based on values in another data.table

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related