1

Data:

    112343  The data point was created on 1903.
    112344  The data point was created on 1909.
    112345  The data point was created on 1919.
    112346  The data point was created on 1911.
    112346  The data point was created on 1911-12.
    112346  The data point was created on 1911-12.
    112347  The data point was created on 1911.
    112348  The data point was created on 1911.

Here duplicates are the id. I want the duplicates to be removed, but I want to keep the longest row[1] (as shown in the ideal output).

Here is what I tried:

import sys
import csv
import re
import string

df = csv.reader(‘fil.csv’, delimiter = ',')

for r in df:
    dup = next(df)
    if r[0] == dup[0]:
        r[1] < dup[1]: #I am checking if the text is larger then the previous
            print dup[0], dup[1]
    else:
        print r[0], r[1]

But I am getting the output as,

112343  The data point was created on 1903.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911.
112348  The data point was created on 1911.

The rows are missing!

The ideal output would be

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

How can this be accomplished? What condition or keyword can I use? Or can I have two duplicate files and compare the rows between them to eliminate duplicates?

11
  • cat fil.csv | sort | uniq > fil_deduped_sorted.csv Commented Oct 29, 2015 at 5:52
  • @Alik that still has two rows starting with 112346 Commented Oct 29, 2015 at 5:54
  • @user1717828 you are right. OP should clearly specify what he means by asking how to remove duplicate lines since 112346 The data point was created on 1911. and 112346 The data point was created on 1911-12. aren't duplicate Commented Oct 29, 2015 at 5:57
  • how to merge rows whose start no are the same? Commented Oct 29, 2015 at 5:59
  • @Alik I have specified what are the duplicates. Thanks Commented Oct 29, 2015 at 6:00

7 Answers 7

1

My attempt:

import csv
import collections

csv_input = """    112343,  The data point was created on 1903.
    112344,  The data point was created on 1909.
    112345,  The data point was created on 1919.
    112346,  The data point was created on 1911.
    112346,  The data point was created on 1911-12.
    112346,  The data point was created on 1911-12.
    112347,  The data point was created on 1911.
    112348,  The data point was created on 1911."""

reader = csv.reader(csv_input.split('\n'))    

result = collections.OrderedDict()
for row_id, data in reader:
    if len(result.get(row_id, ''))<len(data):
        result[row_id] = data

for row_id, data in result.items():
    print "{},{}".format(row_id, data)
Sign up to request clarification or add additional context in comments.

2 Comments

this works. But you are inputting csv_input as a string and then using csv reader. But I have a csv file. Not a string! Is it possible to read it as string?
@GanapathyMani yes, it is possible to read a file into a string
1

Try this:

some_dict = {}
file_name = "sample.csv"
with open(file_name) as f:
    data = csv.reader(f,delimiter = ' ')
    for row in data:
        key = row.pop(0)
        if key in some_dict:
            if len(row[0])>len(some_dict[key]):
                some_dict[key] = row.pop(0)
        else:
            some_dict[key] = row.pop(0)

for key,value in some_dict.iteritems():
    print key,value

5 Comments

I am getting: AttributeError: 'str' object has no attribute 'pop'
This won't work. Also dictionaries in python do not preserve order
@Ganapathy change the delimiter to actual separator used in your file (',', '\t', ';', etc..) and this works. If you use the wrong one it won't separate the columns so you get one single string.
@Alik Yes this does not preserve order but can you explain why this doesn't work ??
AH! ,it prints 112343,The data 112345,The data 112348,The data 112344,The data 112346,The data 112347,The data
1

My solution would be-

import csv
unqkey =set()
data = []

with open("C:\data.csv") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        unqkey.add(row[0])
        data.append(row)


unqkey = sorted(list(unqkey))

for i in unqkey:
    r=[]
    for j in data:
        if j[0]==i:
            r.append(' '.join(j))
            r.sort(key=len)
    print r[-1]

it prints-

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

2 Comments

you solution for some reason messing up the order when I run through CSV.
Ah! would u share more csv content or what is the error message.
1

I'm working on the (not unreasonable) assumption that your data is always sorted on id.

The initialization

from sys import maxint
prev_id = maxint
longest = ""
data = open('myfile.dat')

The loop on data

for row in data:
    curr_id = int(row.split()[0])
    if prev_id < curr_id:
        print longest
        longest = row
    elif len(row)>len(longest): 
        longest = row
    prev_id = curr_id
# here we have still one row to  output
print longest

The relative merit of this answer consists in its memory efficiency, as rows are processed one by one. Of course this efficiency depends on the order I assumed in the data file!

2 Comments

getting this error: AttributeError: 'list' object has no attribute 'split'. I am using CSV file.
In my code there is a single instance of split, in row.split() and row is a string, because a string is what is returned from iterating over a file object. If you need to treat your data as a CSV (TSV maybe?) file you have to adapt my code, but be warned, for what you asked (removing duplicates) the use of the csv module is absolutely gratuitous.
1

This is how I removed the duplicates.

First, I removed duplicates through Excel. But there were still some other duplicates with different column sizes (same id but different length for row[1]). In the duplicated pair of rows, I want the rows that have larger second column (len(row[1] is higher). Here is what I did,

import csv
import sys
dfo = open('fil.csv', 'rU')
df = csv.reader(dfo, delimiter = ',')

temp = ''
temp1 = ''

for r in reversed(list(df)):
    if r[0] == temp:
        continue
    elif len(r[1]) > len(temp1):
            print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
            #I used | for the csv separation. 
    else:
        print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]

    temp = r[0]
    temp1 = r[1]

This took care of the duplicates. Here I basically skip the duplicate row with lesser sized r[1]. Now I got the reversed list printed out. I saved it in a csv file, and printed this new file in reverse again (restoring the original order). It solved the problem.

Comments

0

How to remove duplicate rows from CSV?

Open the CSV in excel. Excel has a built-in tool that allows you to remove duplicates. Follow this tutorial for more info.

1 Comment

Assuming the question is asking for the best way to remove duplicate rows from a CSV, and not restricted to a programmatic way. Reason I say this is because a friend of mine (a researcher in biology) once asked me a similar question - turns out he was learning programming so that he wouldn't have to manually identify and remove duplicate entries from massive excel sheets. He didn't realize excel had this feature built in :).
0

The reason why your code skip lines is because the next function. In my solution, I first read in all lines into a list, then sort the list by the second column, if the first column value is the same, we just keep the first row, and skip others.

import csv
from operator import itemgetter
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

your_list.sort(key=itemgetter(1)) # sorted by the second column
result = [your_list[0]] # to store the filtered results
for index in range(1,len(your_list)):
    if your_list[index] != your_list[index-1][0]:
        result.append(your_list[index])
print result

4 Comments

your_list[index-1][0] - you have a flaw here. What if input file contains lines where first column is the same? I believe your code will produce an empty result in this case
Yes, Alik is right. Plus my rows are already sorted as given in the data!
@Hooting it is better to append first value to result outside of the loop and use start parameter for enumerate.
getting error NameError: name 'value' is not defined

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.