I am getting somewhat unexpected results with df.drop_duplicates().
df2 = df.dropDuplicates()
print(df2.count())
# prints 424527
print(df.count())
# prints 424510
I do not understand why count is higher after dropping duplicates. I can think of following possibilities (naively) , but I don't believe any of them are true:
- These may be approximate counts so should I just ignore them (very unlikely)
- A combination of using EMR studio workspace notebook with a very large cluster (12 instances of m5.16xlarge) which may be causing I don't know what (again, I am at a loss here, I don't know how it works behind the scene)
- A Change in source data somehow. Even though I am running both the counts as consecutive statements, something changes the source datasets. So the lazy evaluation change the counts. Again unlikely as I am running these count statements consecutively.
dfto verify that this is not due to upstream data source changing inbetween the evaluation, I agree with you otherwise I find this confusing.df.sort('key_column').repartition(1).write.csv('before'); df2 = df.dropDuplicates(); df2.sort('key_column').write.csv('after')df.count()doesn't return "approximate". It returns actual count. Most likely explanation is that source is changing.