fast merge(..., all = TRUE) with data.table in R

Question

Is is possible to make the equivalent of a merge(..., all = TRUE) with the data.table syntax (like X[Y]) ?

Specifically, I would need a very fast way of getting the result of:

item_length = data.table(index = 1:10, length =  c(2,5,4,6,3),key ="index")
item_weigth = data.table(index = c(2,4,6,7,8,11), weight= c(.3,.5,.2), key = "index")
merge(x2,y2, all=TRUE)

Which is :

> merge(item_length ,item_weigth , all=TRUE)
      index length weight
[1,]     1      2     NA
[2,]     2      5    0.3
[3,]     3      4     NA
[4,]     4      6    0.5
[5,]     5      3     NA
[6,]     6      2    0.2
[7,]     7      5    0.3
[8,]     8      4    0.5
[9,]     9      6     NA
[10,]    10      3     NA
[11,]    11     NA    0.2

merge.data.table should be pretty fast. Can you provide some timings? We have improved its speed in recent versions. Which version of data.table are you using? — Matt Dowle
– Matt Dowle, Commented Jul 11, 2012 at 16:57
Ok, I've updated to the latest version 1.8.0 and it is actually extremely fast! Thanks a lot ! — nassimhddd
– nassimhddd, Commented Jul 11, 2012 at 17:16

nassimhddd · Accepted Answer · 2012-07-11 17:22:09Z

18

Sorry for answering my own question, but I think this is worth sharing:

A very fast solution seems to be to update to the latest version of data.table (1.8.0). (Thank you so much, Matthew !)

Here is my test data and benchmark results:

With data.table:

full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var1 = rnorm(length(full_index)*ratio_in_samples),
                key = "index")

y <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var2 = rnorm(length(full_index)*ratio_in_samples),
                key = "index")

system.time(
result <- merge(x,y, all=TRUE)
)

Time with data.table:

user  system elapsed 
5.05    0.55    5.62

Whereas with data.frame:

full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var1 = rnorm(length(full_index)*ratio_in_samples))

y <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var2 = rnorm(length(full_index)*ratio_in_samples))

system.time(
  result <- merge(x,y, all=TRUE)
)

Time with data.frame:

user  system elapsed 
78.83    1.75   80.67

answered Jul 11, 2012 at 17:22

nassimhddd

8,5001 gold badge31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

eod Over a year ago

Can it be done even faster?

Maël Over a year ago

There is, check this answer: stackoverflow.com/a/77324984/13460602

Maël · Accepted Answer · 2023-10-19 15:19:15Z

2

collapse has a very fast join function. With the 5M rows data.frame provided by @nassimhddd, collapse is 2.5x faster than data.table:

Benchmark:

# Unit: milliseconds
#      expr       min       lq     mean   median       uq      max
#        dt 1745.0900 2414.042 2598.093 2543.145 2837.262 3450.925
#  collapse  937.0518 1006.102 1076.427 1014.616 1071.268 1353.097

Code

library(microbenchmark)
library(collapse)
library(data.table)
microbenchmark(
  dt = merge(x, y, all = TRUE),
  collapse = join(x, y, how = "full"),
  times = 5L
)

edited Oct 19, 2023 at 15:19

answered Oct 19, 2023 at 15:06

Maël

53k6 gold badges47 silver badges85 bronze badges

2 Comments

Konrad Rudolph Over a year ago

‘collapse’ looks like a very interesting package. Shame the API is so atrocious.

Maël Over a year ago

@KonradRudolph Yes, I'm really a fan of it. I don't think the API is so bad, e.g. compared to data.table

Collectives™ on Stack Overflow

fast merge(..., all = TRUE) with data.table in R

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related