1

I'm trying to merge two data.tables, both of which are around 60-80 million rows in length. I know that data.table is already built so that it is very adept at merging, but I'm wondering for data of this size is data.table still more efficient than potentially parallelizing it, especially since I have access to a computing cluster.

This is what I'm currently doing.

setorder(fcc_temp, BlockCode)
setorder(BlockCode, block_fips)
fcc_temp[block_data_long, c("pop", "tract") := list(pop, tract), 
           on = c(BlockCode="block_fips", year="year")]
2
  • 1
    Maybe the only way to find out is to time it for 10m rows? Commented May 6, 2021 at 21:37
  • setkey is also appreciated. Commented May 7, 2021 at 0:34

1 Answer 1

2

From your example we don't see much details like data types.

data.table join is currently single threaded. Some small parts of it uses multiple cores but this is AFAIR only finding order in the join columns. Computing matches goes in single thread.

Have in mind that parallelizing join is non trivial and will not scale that well as many other operations, so potential gains are much lesser than for in terms of grouping.

Anyway this computing matches is still super fast. We run a benchmark where we compare join, one of the questions (question 5) is "big to big join" which seems to correspond to your scenario. https://h2oai.github.io/db-benchmark/ Below is 100M data size for join task. Q5 is a join of 100M LHS to 100M RHS: enter image description here

You can see that data.table is pretty much on the top. Note that we are joining on single integer column there, so there is likely to be a difference to your scenario where you join on two columns. Benchmark does not take into account possibility to pre-sort data. Try using setkey (instead of setorder) on your tables to sort them by join columns. To be fair I believe it might not be easy to beat this kind of setup.

In future version computing matches of join will be parallelized as well, a drafts of that is already in repository.

Sign up to request clarification or add additional context in comments.

2 Comments

Unbelievable! data.table is even faster than Python's pandas?
pandas is known to be not very competitive so I am not sure why such an impression

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.