How to remove duplicates in a csv file based on two columns?

Question

I have a csv file like this :

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

I want to remove duplicates from this file, to get only :

column1    column2

john       kerry
adam       stephenson
ashley     hudson

I wrote this script that removes duplicates based on lastnames, but I need to remove duplicates based on lastnames AND firstname.

import csv

reader=csv.reader(open('myfilewithduplicates.csv', 'r'), delimiter=',')
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')

lastnames = set()
for row in reader:
    if row[1] not in lastnames:
        writer.writerow(row)
        lastnames.add( row[1] )

"I wrote this script that removes duplicates based on names, but I need to remove duplicates based on name AND firstname." I'm confused here. When you say name do you mean first name, last name or a concatenation? Your script only works on last names. — Jeff
– Jeff, Commented Oct 12, 2012 at 1:34
Sorry for being unclear, I want to remove duplicates based on lastnames (column2) and firstnames (column1) — Reveclair
– Reveclair, Commented Oct 12, 2012 at 1:40

black panda · Accepted Answer · 2012-10-12 01:50:03Z

20

You're really close. Use those columns as the set entry

entries = set()

for row in reader:
   key = (row[0], row[1]) # instead of just the last name

   if key not in entries:
      writer.writerow(row)
      entries.add(key)

answered Oct 12, 2012 at 1:50

black panda

3,0111 gold badge23 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bradley · Accepted Answer · 2013-06-13 02:29:25Z

13

You can now use the .drop_duplicates method in pandas. I would do the following:

import pandas as pd
toclean = pd.read_csv('myfilewithduplicates.csv')
deduped = toclean.drop_duplicates([col1,col2])
deduped.to_csv('myfilewithoutduplicates.csv')

answered Jun 13, 2013 at 2:29

Bradley

2,0773 gold badges14 silver badges17 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:02:49Z

A quick way would be to create a unique set of rows using the following technique (adopted from @CedricJulien from this post). You lose the DictWriter benefit of having the column names stored in each row, but it should work for you case:

>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
...   reader = csv.reader(f)
...   uniq = [list(tup) for tup in set([tuple(row) for row in reader])]
...
>>> with open('nodupes.csv', 'w') as f:
...   writer=csv.writer(f)
...   for row in uniq:
...     writer.writerow(row)

This uses the same technique used by @CedricJulien, which is a nice one-liner to remove the duplicate rows (defined as the same first and last name). This uses the DictReader/DictWriter classes:

>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
...   reader = csv.DictReader(f)
...   rows = [row for row in reader]
...
>>> uniq = [dict(tup) for tup in set(tuple(person.items()) for person in rows)]
>>> with open('nodupes.csv', 'w') as f:
...   headers = ['column1', 'column2']
...   writer = csv.DictWriter(f, fieldnames=headers)
...   writer.writerow(dict((h, h) for h in headers))

...   for row in uniq:
...     writer.writerow(row)
...

Collectives™ on Stack Overflow

How to remove duplicates in a csv file based on two columns?

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related