0

I am getting somewhat unexpected results with df.drop_duplicates().

df2 = df.dropDuplicates()
print(df2.count())
# prints 424527

print(df.count())
# prints 424510 

I do not understand why count is higher after dropping duplicates. I can think of following possibilities (naively) , but I don't believe any of them are true:

  1. These may be approximate counts so should I just ignore them (very unlikely)
  2. A combination of using EMR studio workspace notebook with a very large cluster (12 instances of m5.16xlarge) which may be causing I don't know what (again, I am at a loss here, I don't know how it works behind the scene)
  3. A Change in source data somehow. Even though I am running both the counts as consecutive statements, something changes the source datasets. So the lazy evaluation change the counts. Again unlikely as I am running these count statements consecutively.
7
  • to check 3, simply change the order. Count df first, then df2. Commented Jan 7 at 17:06
  • 2
    I would cache df to verify that this is not due to upstream data source changing inbetween the evaluation, I agree with you otherwise I find this confusing. Commented Jan 7 at 19:25
  • Agree with both comments above. Given spark is lazy evaluation, you can't really expect your code (with it's order and operations) to produce results you're expecting. Best IMHO is to write to csv files before and after and then compare: df.sort('key_column').repartition(1).write.csv('before'); df2 = df.dropDuplicates(); df2.sort('key_column').write.csv('after') Commented Jan 7 at 23:29
  • I will try first two. @kashyap, why so you suggest sort before writing Commented Jan 9 at 2:18
  • 1
    df.count() doesn't return "approximate". It returns actual count. Most likely explanation is that source is changing. Commented Jan 9 at 19:21

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.