Combining chaining and assignment by reference in a data.table

Question

Is it possible to combine chaining and assignment by reference in a data.table?

For example, I would like to do this:

DT[a == 1][b == 0, c := 2]

However, this leaves the original table unchanged, as a temporary table seems to be created after DT[a == 1] which is subsequently changed and returned.

I would rather not do

DT[a == 1 & b == 0, c := 2]

as this is very slow and I would also rather avoid

 DT <- DT[a == 1][b == 0, c := 2]

as I would prefer to do the assignment by reference. This question is part of the question [1], where it is left unanswered.

[1] Conditional binary join and update by reference using the data.table package

I don't see how this question is related to the other one, and you need to show context where DT[a == 1 & b == 0] is "very slow". If that's the part that's slow for you more than likely you're doing something else wrong. — eddi
– eddi, Commented Apr 23, 2015 at 2:10

David Arenburg · Accepted Answer · 2015-04-22 22:45:06Z

5

I'm not sure why you think that even if DT[a == 1][b == 0, c := 2] worked in theory it would be more efficient than DT[a == 1 & b == 0, c := 2]

Either way, the most efficient solution in your case would be to key by both a and b and conduct the assignment by reference while performing a binary join on both

DT <- data.table(a = c(1, 1, 1, 2, 2), b = c(0, 2, 0, 1, 1)) ## mock data
setkey(DT, a, b) ## keying by both `a` and `b`
DT[J(1, 0), c := 2] ## Update `c` by reference
DT
#    a b  c
# 1: 1 0  2
# 2: 1 0  2
# 3: 1 2 NA
# 4: 2 1 NA
# 5: 2 1 NA

answered Apr 22, 2015 at 22:45

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Hans-Peter Schrei Over a year ago

Thanks, can this also be extended to arbitrary conditions, say a %in% set and b < constant?

David Arenburg Over a year ago

No. What's wrong with just do DT[a %in% 1:2 & b == 0, c := 2] for example? It should be very efficient as data.table sets secondary keys (if you didn't). But in this case you already keyed the data.

eddi Over a year ago

I strongly dislike the answers that recommend keying to "speed up" simple look-up. The vast majority of the time that's the exact wrong thing to do and will result in slowing down, because users will take an already bad thing they did (which is running a non-vectorized loop), and then make it worse by adding an extra sort in there.

Arun Over a year ago

Computing the order isn't that expensive. I've done extensive tests on data sets with up to a billion rows (and several columns). It can be slower than normal vector scan on really large data (>10 or 100 million rows) on the first run, but not by much (a few seconds). It's the re-ordering that is expensive. Hence setkey() wouldn't give much for normal operations.. Joins on large data benefit from cache efficiency due to sorted data. i.e., in large data where cache inefficiency trumps reordering time.. 1/2 [A nice way to compare this is with dplyr on really large data].

Arun Over a year ago

Auto indexing is just the beginning of secondary keys.. They'll be extended to joins, rolling values on ordinary subsets etc.. You'll always be able to setkey() still if you believe your data is just too large that cache inefficiency trumps reordering. but the advantage of secondary keys is that we know exact indices of each value you'd like to subset/join, in addition to not having to reorder data. It's a great compromise between speed and functionality in most cases.

|

Collectives™ on Stack Overflow

Combining chaining and assignment by reference in a data.table

1 Answer 1

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related