2

I'm attempting to update a data.table based on values in another table. I thought I had an approach (although not particularly efficient given my actual dt2 has 3.5M records) but it turns out something is going wrong with my code

In DT1 I have a number of variables (numeric) that I am wanting to set to 1 or 0 depending on what their classification value is.

Example of data:

dt1 <- data.table(urn = 1:10, V1=0,V2=0,V3=0)
##    urn V1 V2 V3
## 1:   1  0  0  0
## 2:   2  0  0  0
## 3:   3  0  0  0
## 4:   4  0  0  0
## 5:   5  0  0  0
## 6:   6  0  0  0
## 7:   7  0  0  0
## 8:   8  0  0  0
## 9:   9  0  0  0
##10:  10  0  0  0

dt2 <- data.table(urn=rep(1:10,2),classification=0)
dt2$classification <- 1:7 #does give a warning message
##    urn classification
## 1:   1              1
## 2:   2              2
## 3:   3              3
## 4:   4              4
## 5:   5              5
## 6:   6              6
## 7:   7              7
## 8:   8              1
## 9:   9              2
##10:  10              3
##11:   1              4
##12:   2              5
##13:   3              6
##14:   4              7
##15:   5              1
##16:   6              2
##17:   7              3
##18:   8              4
##19:   9              5
##20:  10              6

What I want to happen is where the urn matches, and the urn has a record with classification of 1, then set the V1 flag to 1; where the urn has a classification of 2, set V2 = 1; you get the idea.

My current approach has been:

setkey(dt1, urn)
setkey(dt2, urn)
dt1[dt2, V1:= ifelse(i.classification == 1 , 1, dt1$V1)]
dt1[dt2, V2:= ifelse(i.classification == 2, 1, dt1$V2)]
dt1[dt2, V3:= ifelse(i.classification == 6, 1, dt1$V3)]
dt1
##    urn V1 V2 V3
## 1:   1  1  0  0
## 2:   2  1  0  0
## 3:   3  1  0  1
## 4:   4  1  0  0
## 5:   5  1  0  0
## 6:   6  1  1  0
## 7:   7  1  0  0
## 8:   8  1  0  0
## 9:   9  1  0  0
##10:  10  1  0  1

What I actually expect the output to look like is:

##    urn V1 V2 V3
## 1:   1  1  0  0
## 2:   2  0  1  0
## 3:   3  0  0  1
## 4:   4  0  0  0
## 5:   5  1  0  0
## 6:   6  0  1  1
## 7:   7  0  0  0
## 8:   8  1  0  0
## 9:   9  0  1  0
##10:  10  0  0  1

Any help would be greatly appreciated. Particularly if I am able to simplify the command to manage all of these in a single scan rather than 3 joins of data. As I said 3.5M records adds a bit of overhead (particularly when I'll actually be setting flags for about 10 columns).

Thanks in advance.

1
  • Something like dcast(dt1[dt2,], urn~classification) might work Commented Sep 29, 2014 at 10:35

1 Answer 1

2

You can use dcast.data.table for this.

dcast.data.table(data = dt1[dt2,], 
      formula = urn ~ classification, 
      fun.aggregate = function(x) as.numeric(any(!is.na(x))), 
      value.var = 'classification')[, list(urn, V1=`1`, V2=`2`, V3=`6`)] 
##     urn V1 V2 V3
##  1:   1  1  0  0
##  2:   2  0  1  0
##  3:   3  0  0  1
##  4:   4  0  0  0
##  5:   5  1  0  0
##  6:   6  0  1  1
##  7:   7  0  0  0
##  8:   8  1  0  0
##  9:   9  0  1  0
## 10:  10  0  0  1

EDIT: You can wrap the last statement in tryCatch functions if necessary:

dcast.data.table(data = dt1[dt2,], 
                 formula = urn ~ classification, 
                 fun.aggregate = function(x) as.numeric(any(!is.na(x))), 
                 value.var = 'classification')[, list(urn, 
                                                      V1=tryCatch(`1`, error =     ## function(e) NA), 
                                                      V2=tryCatch(`2`, error = function(e) NA), 
                                                      V3=tryCatch(`8`, error = function(e) NA))] 
##     urn V1 V2 V3
##  1:   1  1  0 NA
##  2:   2  0  1 NA
##  3:   3  0  0 NA
##  4:   4  0  0 NA
##  5:   5  1  0 NA
##  6:   6  0  1 NA
##  7:   7  0  0 NA
##  8:   8  1  0 NA
##  9:   9  0  1 NA
## 10:  10  0  0 NA
Sign up to request clarification or add additional context in comments.

3 Comments

This looks exceptionally promising, but I can't seem to get it to work with my real data. I do have a number of other columns in dt1 & dt2 in my real world data. Is this likely to have a negative impact on how the above works, I wouldn't have suspected so given that everything appears to be named in the above... really appreciate the help so far. [I'll be in meetings for the next several hours so it may take a while to respond]
Can you make a reproducible example, where this does not work? Or at least print the error message you get? It's very hard to diagnose the problem otherwise.
Thanks @shadow, I realised the error of my ways. I was checking the contents of dt1 again after I had run the code without having updated dt1. I've figured a way around that. Thanks I do have another extension of my question though, If instead of the value of 6 for V3 I wanted to use 8 for V3, I get an error because dt2 doesn't have any records with classification=8. I'm hoping to be able to make this a fairly generic piece of code, that would be able to account for the presence of a range of classifications, but won't fall over if those classifications aren't there. Any tips?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.