Data:
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
Here duplicates are the id. I want the duplicates to be removed, but I want to keep the longest row[1] (as shown in the ideal output).
Here is what I tried:
import sys
import csv
import re
import string
df = csv.reader(‘fil.csv’, delimiter = ',')
for r in df:
dup = next(df)
if r[0] == dup[0]:
r[1] < dup[1]: #I am checking if the text is larger then the previous
print dup[0], dup[1]
else:
print r[0], r[1]
But I am getting the output as,
112343 The data point was created on 1903.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911-12.
112346 The data point was created on 1911.
112348 The data point was created on 1911.
The rows are missing!
The ideal output would be
112343 The data point was created on 1903.
112344 The data point was created on 1909.
112345 The data point was created on 1919.
112346 The data point was created on 1911-12.
112347 The data point was created on 1911.
112348 The data point was created on 1911.
How can this be accomplished? What condition or keyword can I use? Or can I have two duplicate files and compare the rows between them to eliminate duplicates?
cat fil.csv | sort | uniq > fil_deduped_sorted.csv112346112346 The data point was created on 1911.and112346 The data point was created on 1911-12.aren't duplicate