I have a large dataframe which I created with 800 partitions.
df.rdd.getNumPartitions()
800
When I use dropDuplicates on the dataframe, it changes the partitions to default 200
df = df.dropDuplicates()
df.rdd.getNumPartitions()
200
This behaviour causes problem for me, as it will lead to out of memory.
Do you have any suggestion on fixing this problem? I tried setting spark.sql.shuffle.partition to 800 but it doesn't work. Thanks