How to remove duplicate rows from CSV?

Question

Data:

    112343  The data point was created on 1903.
    112344  The data point was created on 1909.
    112345  The data point was created on 1919.
    112346  The data point was created on 1911.
    112346  The data point was created on 1911-12.
    112346  The data point was created on 1911-12.
    112347  The data point was created on 1911.
    112348  The data point was created on 1911.

Here duplicates are the id. I want the duplicates to be removed, but I want to keep the longest row[1] (as shown in the ideal output).

Here is what I tried:

import sys
import csv
import re
import string

df = csv.reader(‘fil.csv’, delimiter = ',')

for r in df:
    dup = next(df)
    if r[0] == dup[0]:
        r[1] < dup[1]: #I am checking if the text is larger then the previous
            print dup[0], dup[1]
    else:
        print r[0], r[1]

But I am getting the output as,

112343  The data point was created on 1903.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911-12.
112346  The data point was created on 1911.
112348  The data point was created on 1911.

The rows are missing!

The ideal output would be

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

How can this be accomplished? What condition or keyword can I use? Or can I have two duplicate files and compare the rows between them to eliminate duplicates?

@user1717828 you are right. OP should clearly specify what he means by asking how to remove duplicate lines since 112346 The data point was created on 1911. and 112346 The data point was created on 1911-12. aren't duplicate — Konstantin
– Konstantin, Commented Oct 29, 2015 at 5:57

Konstantin · Accepted Answer · 2015-10-29 06:59:02Z

1

My attempt:

import csv
import collections

csv_input = """    112343,  The data point was created on 1903.
    112344,  The data point was created on 1909.
    112345,  The data point was created on 1919.
    112346,  The data point was created on 1911.
    112346,  The data point was created on 1911-12.
    112346,  The data point was created on 1911-12.
    112347,  The data point was created on 1911.
    112348,  The data point was created on 1911."""

reader = csv.reader(csv_input.split('\n'))    

result = collections.OrderedDict()
for row_id, data in reader:
    if len(result.get(row_id, ''))<len(data):
        result[row_id] = data

for row_id, data in result.items():
    print "{},{}".format(row_id, data)

answered Oct 29, 2015 at 6:59

Konstantin

25.5k5 gold badges53 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gravity Mass Over a year ago

this works. But you are inputting csv_input as a string and then using csv reader. But I have a csv file. Not a string! Is it possible to read it as string?

Konstantin Over a year ago

@GanapathyMani yes, it is possible to read a file into a string

Anil · Accepted Answer · 2015-10-29 07:14:15Z

1

Try this:

some_dict = {}
file_name = "sample.csv"
with open(file_name) as f:
    data = csv.reader(f,delimiter = ' ')
    for row in data:
        key = row.pop(0)
        if key in some_dict:
            if len(row[0])>len(some_dict[key]):
                some_dict[key] = row.pop(0)
        else:
            some_dict[key] = row.pop(0)

for key,value in some_dict.iteritems():
    print key,value

edited Oct 29, 2015 at 7:14

answered Oct 29, 2015 at 7:07

Anil

6617 silver badges16 bronze badges

5 Comments

Gravity Mass Over a year ago

I am getting: AttributeError: 'str' object has no attribute 'pop'

Konstantin Over a year ago

This won't work. Also dictionaries in python do not preserve order

Anil Over a year ago

@Ganapathy change the delimiter to actual separator used in your file (',', '\t', ';', etc..) and this works. If you use the wrong one it won't separate the columns so you get one single string.

Anil Over a year ago

@Alik Yes this does not preserve order but can you explain why this doesn't work ??

Learner Over a year ago

AH! ,it prints 112343,The data 112345,The data 112348,The data 112344,The data 112346,The data 112347,The data

Learner · Accepted Answer · 2015-10-29 08:52:38Z

1

My solution would be-

import csv
unqkey =set()
data = []

with open("C:\data.csv") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        unqkey.add(row[0])
        data.append(row)


unqkey = sorted(list(unqkey))

for i in unqkey:
    r=[]
    for j in data:
        if j[0]==i:
            r.append(' '.join(j))
            r.sort(key=len)
    print r[-1]

it prints-

112343  The data point was created on 1903.
112344  The data point was created on 1909.
112345  The data point was created on 1919.
112346  The data point was created on 1911-12.
112347  The data point was created on 1911.
112348  The data point was created on 1911.

edited Oct 29, 2015 at 8:52

answered Oct 29, 2015 at 7:08

Learner

5,3001 gold badge29 silver badges39 bronze badges

2 Comments

Gravity Mass Over a year ago

you solution for some reason messing up the order when I run through CSV.

Learner Over a year ago

Ah! would u share more csv content or what is the error message.

gboffi · Accepted Answer · 2015-10-29 12:19:18Z

1

I'm working on the (not unreasonable) assumption that your data is always sorted on id.

The initialization

from sys import maxint
prev_id = maxint
longest = ""
data = open('myfile.dat')

The loop on data

for row in data:
    curr_id = int(row.split()[0])
    if prev_id < curr_id:
        print longest
        longest = row
    elif len(row)>len(longest): 
        longest = row
    prev_id = curr_id
# here we have still one row to  output
print longest

The relative merit of this answer consists in its memory efficiency, as rows are processed one by one. Of course this efficiency depends on the order I assumed in the data file!

edited Oct 29, 2015 at 12:19

answered Oct 29, 2015 at 10:08

gboffi

25.4k10 gold badges62 silver badges98 bronze badges

2 Comments

Gravity Mass Over a year ago

getting this error: AttributeError: 'list' object has no attribute 'split'. I am using CSV file.

gboffi Over a year ago

In my code there is a single instance of split, in row.split() and row is a string, because a string is what is returned from iterating over a file object. If you need to treat your data as a CSV (TSV maybe?) file you have to adapt my code, but be warned, for what you asked (removing duplicates) the use of the csv module is absolutely gratuitous.

Gravity Mass · Accepted Answer · 2015-10-29 19:50:48Z

This is how I removed the duplicates.

First, I removed duplicates through Excel. But there were still some other duplicates with different column sizes (same id but different length for row[1]). In the duplicated pair of rows, I want the rows that have larger second column (len(row[1] is higher). Here is what I did,

import csv
import sys
dfo = open('fil.csv', 'rU')
df = csv.reader(dfo, delimiter = ',')

temp = ''
temp1 = ''

for r in reversed(list(df)):
    if r[0] == temp:
        continue
    elif len(r[1]) > len(temp1):
            print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]
            #I used | for the csv separation. 
    else:
        print r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3]

    temp = r[0]
    temp1 = r[1]

This took care of the duplicates. Here I basically skip the duplicate row with lesser sized r[1]. Now I got the reversed list printed out. I saved it in a csv file, and printed this new file in reverse again (restoring the original order). It solved the problem.

JSideris · Accepted Answer · 2015-10-29 06:33:49Z

0

How to remove duplicate rows from CSV?

Open the CSV in excel. Excel has a built-in tool that allows you to remove duplicates. Follow this tutorial for more info.

answered Oct 29, 2015 at 6:33

JSideris

5,3114 gold badges36 silver badges54 bronze badges

1 Comment

JSideris Over a year ago

Assuming the question is asking for the best way to remove duplicate rows from a CSV, and not restricted to a programmatic way. Reason I say this is because a friend of mine (a researcher in biology) once asked me a similar question - turns out he was learning programming so that he wouldn't have to manually identify and remove duplicate entries from massive excel sheets. He didn't realize excel had this feature built in :).

Hooting · Accepted Answer · 2015-10-29 09:06:29Z

0

The reason why your code skip lines is because the next function. In my solution, I first read in all lines into a list, then sort the list by the second column, if the first column value is the same, we just keep the first row, and skip others.

import csv
from operator import itemgetter
with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

your_list.sort(key=itemgetter(1)) # sorted by the second column
result = [your_list[0]] # to store the filtered results
for index in range(1,len(your_list)):
    if your_list[index] != your_list[index-1][0]:
        result.append(your_list[index])
print result

edited Oct 29, 2015 at 9:06

answered Oct 29, 2015 at 6:23

Hooting

1,72111 silver badges20 bronze badges

4 Comments

Konstantin Over a year ago

your_list[index-1][0] - you have a flaw here. What if input file contains lines where first column is the same? I believe your code will produce an empty result in this case

Gravity Mass Over a year ago

Yes, Alik is right. Plus my rows are already sorted as given in the data!

Konstantin Over a year ago

@Hooting it is better to append first value to result outside of the loop and use start parameter for enumerate.

Learner Over a year ago

getting error NameError: name 'value' is not defined

Collectives™ on Stack Overflow

How to remove duplicate rows from CSV?

7 Answers 7

2 Comments

5 Comments

2 Comments

2 Comments

Comments

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

5 Comments

2 Comments

2 Comments

Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related