6

Is it possible to combine chaining and assignment by reference in a data.table?

For example, I would like to do this:

DT[a == 1][b == 0, c := 2]

However, this leaves the original table unchanged, as a temporary table seems to be created after DT[a == 1] which is subsequently changed and returned.

I would rather not do

DT[a == 1 & b == 0, c := 2]

as this is very slow and I would also rather avoid

 DT <- DT[a == 1][b == 0, c := 2]

as I would prefer to do the assignment by reference. This question is part of the question [1], where it is left unanswered.

[1] Conditional binary join and update by reference using the data.table package

1
  • I don't see how this question is related to the other one, and you need to show context where DT[a == 1 & b == 0] is "very slow". If that's the part that's slow for you more than likely you're doing something else wrong. Commented Apr 23, 2015 at 2:10

1 Answer 1

5

I'm not sure why you think that even if DT[a == 1][b == 0, c := 2] worked in theory it would be more efficient than DT[a == 1 & b == 0, c := 2]

Either way, the most efficient solution in your case would be to key by both a and b and conduct the assignment by reference while performing a binary join on both

DT <- data.table(a = c(1, 1, 1, 2, 2), b = c(0, 2, 0, 1, 1)) ## mock data
setkey(DT, a, b) ## keying by both `a` and `b`
DT[J(1, 0), c := 2] ## Update `c` by reference
DT
#    a b  c
# 1: 1 0  2
# 2: 1 0  2
# 3: 1 2 NA
# 4: 2 1 NA
# 5: 2 1 NA
Sign up to request clarification or add additional context in comments.

15 Comments

Thanks, can this also be extended to arbitrary conditions, say a %in% set and b < constant?
No. What's wrong with just do DT[a %in% 1:2 & b == 0, c := 2] for example? It should be very efficient as data.table sets secondary keys (if you didn't). But in this case you already keyed the data.
I strongly dislike the answers that recommend keying to "speed up" simple look-up. The vast majority of the time that's the exact wrong thing to do and will result in slowing down, because users will take an already bad thing they did (which is running a non-vectorized loop), and then make it worse by adding an extra sort in there.
Computing the order isn't that expensive. I've done extensive tests on data sets with up to a billion rows (and several columns). It can be slower than normal vector scan on really large data (>10 or 100 million rows) on the first run, but not by much (a few seconds). It's the re-ordering that is expensive. Hence setkey() wouldn't give much for normal operations.. Joins on large data benefit from cache efficiency due to sorted data. i.e., in large data where cache inefficiency trumps reordering time.. 1/2 [A nice way to compare this is with dplyr on really large data].
Auto indexing is just the beginning of secondary keys.. They'll be extended to joins, rolling values on ordinary subsets etc.. You'll always be able to setkey() still if you believe your data is just too large that cache inefficiency trumps reordering. but the advantage of secondary keys is that we know exact indices of each value you'd like to subset/join, in addition to not having to reorder data. It's a great compromise between speed and functionality in most cases.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.