Using dropDuplicates in dataframe causes changes in the partition number

Question

I have a large dataframe which I created with 800 partitions.

df.rdd.getNumPartitions()
800

When I use dropDuplicates on the dataframe, it changes the partitions to default 200

df = df.dropDuplicates()
df.rdd.getNumPartitions()
200

This behaviour causes problem for me, as it will lead to out of memory.

Do you have any suggestion on fixing this problem? I tried setting spark.sql.shuffle.partition to 800 but it doesn't work. Thanks

Possible duplicate of Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame — eliasah
– eliasah, Commented May 26, 2016 at 9:54

zero323 · Accepted Answer · 2016-05-28 13:07:25Z

10

This happens because dropDuplicates requires a shuffle. If you want to get a specific number of partitions you should set spark.sql.shuffle.partitions (its default value is 200)

df = sc.parallelize([("a", 1)]).toDF()
df.rdd.getNumPartitions()
## 8

df.dropDuplicates().rdd.getNumPartitions()
## 200

sqlContext.setConf("spark.sql.shuffle.partitions", "800")

df.dropDuplicates().rdd.getNumPartitions()
## 800

An alternative approach (Spark 1.6+) is to repartition first:

df.repartition(801, *df.columns).dropDuplicates().rdd.getNumPartitions()
## 801

It is slightly more flexible but less efficient because doesn't perform local aggregation.

edited May 28, 2016 at 13:07

answered May 26, 2016 at 10:08

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Michael Over a year ago

Thank you. I realised my mistake is in missing the last character 's' in spark.sql.shuffle.partition.

Community · Accepted Answer · 2017-05-23 12:07:56Z

0

I found the solution at Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Use reduceByKey instead of dropDuplicates. reduceByKey also have an option of specifying the number of partitions for the final rdd.

The downside of using reduceByKey in this case is it is slow.

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered May 26, 2016 at 9:50

Michael

1,4386 gold badges29 silver badges40 bronze badges

Collectives™ on Stack Overflow

Using dropDuplicates in dataframe causes changes in the partition number

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related