Quickly merging two data.tables - parallelization or data.table

Question

I'm trying to merge two data.tables, both of which are around 60-80 million rows in length. I know that data.table is already built so that it is very adept at merging, but I'm wondering for data of this size is data.table still more efficient than potentially parallelizing it, especially since I have access to a computing cluster.

This is what I'm currently doing.

setorder(fcc_temp, BlockCode)
setorder(BlockCode, block_fips)
fcc_temp[block_data_long, c("pop", "tract") := list(pop, tract), 
           on = c(BlockCode="block_fips", year="year")]

Maybe the only way to find out is to time it for 10m rows?

chinsoon12
– chinsoon12

2021-05-06 21:37:04 +00:00
Commented May 6, 2021 at 21:37 — chinsoon12
– chinsoon12, Commented May 6, 2021 at 21:37
setkey is also appreciated.

Peace Wang
– Peace Wang

2021-05-07 00:34:10 +00:00
Commented May 7, 2021 at 0:34 — Peace Wang
– Peace Wang, Commented May 7, 2021 at 0:34

jangorecki · Accepted Answer · 2021-05-12 20:22:29Z

2

From your example we don't see much details like data types.

data.table join is currently single threaded. Some small parts of it uses multiple cores but this is AFAIR only finding order in the join columns. Computing matches goes in single thread.

Have in mind that parallelizing join is non trivial and will not scale that well as many other operations, so potential gains are much lesser than for in terms of grouping.

Anyway this computing matches is still super fast. We run a benchmark where we compare join, one of the questions (question 5) is "big to big join" which seems to correspond to your scenario. https://h2oai.github.io/db-benchmark/ Below is 100M data size for join task. Q5 is a join of 100M LHS to 100M RHS:

You can see that data.table is pretty much on the top. Note that we are joining on single integer column there, so there is likely to be a difference to your scenario where you join on two columns. Benchmark does not take into account possibility to pre-sort data. Try using setkey (instead of setorder) on your tables to sort them by join columns. To be fair I believe it might not be easy to beat this kind of setup.

In future version computing matches of join will be parallelized as well, a drafts of that is already in repository.

answered May 12, 2021 at 20:22

jangorecki

16.8k5 gold badges86 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anoushiravan R Over a year ago

Unbelievable! data.table is even faster than Python's pandas?

jangorecki Over a year ago

pandas is known to be not very competitive so I am not sure why such an impression

Collectives™ on Stack Overflow

Quickly merging two data.tables - parallelization or data.table

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related