pyspark drop_duplicates() unexpectedly increases count

Ask Question

Asked 10 months ago

Modified 10 months ago

Viewed 76 times

Part of AWS Collective

I am getting somewhat unexpected results with df.drop_duplicates().

df2 = df.dropDuplicates()
print(df2.count())
# prints 424527

print(df.count())
# prints 424510

I do not understand why count is higher after dropping duplicates. I can think of following possibilities (naively) , but I don't believe any of them are true:

These may be approximate counts so should I just ignore them (very unlikely)
A combination of using EMR studio workspace notebook with a very large cluster (12 instances of m5.16xlarge) which may be causing I don't know what (again, I am at a loss here, I don't know how it works behind the scene)
A Change in source data somehow. Even though I am running both the counts as consecutive statements, something changes the source datasets. So the lazy evaluation change the counts. Again unlikely as I am running these count statements consecutively.

asked Jan 7 at 15:30

Gaurav Singhal

1,1262 gold badges11 silver badges28 bronze badges

to check 3, simply change the order. Count df first, then df2.

Steven
– Steven

2025-01-07 17:06:24 +00:00
Commented Jan 7 at 17:06
2

I would cache df to verify that this is not due to upstream data source changing inbetween the evaluation, I agree with you otherwise I find this confusing.

Jakub Kaplan
– Jakub Kaplan

2025-01-07 19:25:54 +00:00
Commented Jan 7 at 19:25
Agree with both comments above. Given spark is lazy evaluation, you can't really expect your code (with it's order and operations) to produce results you're expecting. Best IMHO is to write to csv files before and after and then compare: df.sort('key_column').repartition(1).write.csv('before'); df2 = df.dropDuplicates(); df2.sort('key_column').write.csv('after')

Kashyap
– Kashyap

2025-01-07 23:29:59 +00:00
Commented Jan 7 at 23:29
I will try first two. @kashyap, why so you suggest sort before writing

Gaurav Singhal
– Gaurav Singhal

2025-01-09 02:18:24 +00:00
Commented Jan 9 at 2:18
1

df.count() doesn't return "approximate". It returns actual count. Most likely explanation is that source is changing.

Kashyap
– Kashyap

2025-01-09 19:21:59 +00:00
Commented Jan 9 at 19:21

| Show 2 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

pyspark drop_duplicates() unexpectedly increases count

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest