Spark colocated join between two partitioned dataframes

Question

For the following join between two DataFrames in Spark 1.6.0

val df0Rep = df0.repartition(32, col("a")).cache
val df1Rep = df1.repartition(32, col("a")).cache
val dfJoin = df0Rep.join(df1Rep, "a")
println(dfJoin.count)

Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you.

According to these two sources, Co-location of two RDDs is guaranteed in this case. groups.google.com/forum/m/#!topic/spark-users/gUyCSoFo5RI safaribooksonline.com/library/view/high-performance-spark/… — harryNYC
– harryNYC, Commented Mar 24, 2017 at 0:14
I think both of those links are to discussions about RDDs. It's not clear if you can assure partitions are co-located in the same way for Dataframes/DataSets. I am interested in a more definative answer. — Frank Wilson
– Frank Wilson, Commented Feb 22, 2019 at 19:36

Anup Thomas · Accepted Answer · 2020-03-31 13:13:29Z

1

[https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c]

According to the article link provided above Sort-Merge join is the default join, would like to add important point

For Ideal performance of Sort-Merge join, it is important that all rows having the same value for the join key are available in the same partition. This warrants for the infamous partition exchange(shuffle) between executors. Collocated partitions can avoid unnecessary data shuffle. Data needs to be evenly distributed n the join keys. The number of join keys is unique enough so that they can be equally distributed across the cluster to achieve the max parallelism from the available partitions

answered Mar 31, 2020 at 13:13

Anup Thomas

1987 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Madhava Carrillo Over a year ago

Updated link medium.com/@achilleus/…

Collectives™ on Stack Overflow

Spark colocated join between two partitioned dataframes

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related