13

Is is possible to make the equivalent of a merge(..., all = TRUE) with the data.table syntax (like X[Y]) ?

Specifically, I would need a very fast way of getting the result of:

item_length = data.table(index = 1:10, length =  c(2,5,4,6,3),key ="index")
item_weigth = data.table(index = c(2,4,6,7,8,11), weight= c(.3,.5,.2), key = "index")
merge(x2,y2, all=TRUE)

Which is :

> merge(item_length ,item_weigth , all=TRUE)
      index length weight
[1,]     1      2     NA
[2,]     2      5    0.3
[3,]     3      4     NA
[4,]     4      6    0.5
[5,]     5      3     NA
[6,]     6      2    0.2
[7,]     7      5    0.3
[8,]     8      4    0.5
[9,]     9      6     NA
[10,]    10      3     NA
[11,]    11     NA    0.2
2
  • 3
    merge.data.table should be pretty fast. Can you provide some timings? We have improved its speed in recent versions. Which version of data.table are you using? Commented Jul 11, 2012 at 16:57
  • Ok, I've updated to the latest version 1.8.0 and it is actually extremely fast! Thanks a lot ! Commented Jul 11, 2012 at 17:16

2 Answers 2

18

Sorry for answering my own question, but I think this is worth sharing:

A very fast solution seems to be to update to the latest version of data.table (1.8.0). (Thank you so much, Matthew !)

Here is my test data and benchmark results:

With data.table:

full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var1 = rnorm(length(full_index)*ratio_in_samples),
                key = "index")

y <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var2 = rnorm(length(full_index)*ratio_in_samples),
                key = "index")

system.time(
result <- merge(x,y, all=TRUE)
)

Time with data.table:

user  system elapsed 
5.05    0.55    5.62

Whereas with data.frame:

full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var1 = rnorm(length(full_index)*ratio_in_samples))

y <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples), 
                var2 = rnorm(length(full_index)*ratio_in_samples))

system.time(
  result <- merge(x,y, all=TRUE)
)

Time with data.frame:

user  system elapsed 
78.83    1.75   80.67 
Sign up to request clarification or add additional context in comments.

2 Comments

Can it be done even faster?
There is, check this answer: stackoverflow.com/a/77324984/13460602
2

collapse has a very fast join function. With the 5M rows data.frame provided by @nassimhddd, collapse is 2.5x faster than data.table:

Benchmark:

# Unit: milliseconds
#      expr       min       lq     mean   median       uq      max
#        dt 1745.0900 2414.042 2598.093 2543.145 2837.262 3450.925
#  collapse  937.0518 1006.102 1076.427 1014.616 1071.268 1353.097

Code

library(microbenchmark)
library(collapse)
library(data.table)
microbenchmark(
  dt = merge(x, y, all = TRUE),
  collapse = join(x, y, how = "full"),
  times = 5L
)

2 Comments

‘collapse’ looks like a very interesting package. Shame the API is so atrocious.
@KonradRudolph Yes, I'm really a fan of it. I don't think the API is so bad, e.g. compared to data.table

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.