How do you remove duplicates from a list in Python whilst preserving order and length?

Question

What I want to do is to remove duplicates from the list and every time duplicate is removed insert an empty item.

I have code for removing duplicates. It also ignores empty list items

import csv

#Create new output file

new_file = open('addr_list_corrected.csv','w')
new_file.close()

with open('addr_list.csv', 'r') as addr_list:
    csv_reader = csv.reader(addr_list, delimiter=',')
    for row in csv_reader:

        print row
        print "##########################"
        seen=set()
        seen_add=seen.add
        #empty cell/element evaluates to false

        new_row = [ cell for cell in row if not (cell and cell in seen or seen_add(cell))]
        print new_row

        with open('addr_list_corrected.csv', 'a') as addr_list_corrected:
            csv_writer=csv.writer(addr_list_corrected, delimiter=',')
            csv_writer.writerow(new_row)

But I need to replace every removed item with an empty string.

possible duplicate of How do you remove duplicates from a list in Python whilst preserving order? — Paul Hankin
– Paul Hankin, Commented Mar 23, 2015 at 2:48
I've voted to close this a dupe. The duped answer doesn't "insert empty items", but it's a trivial modification to do so. — Paul Hankin
– Paul Hankin, Commented Mar 23, 2015 at 2:49
Take a look at the unique_everseen function in the Itertools Recipes. — wwii
– wwii, Commented Mar 23, 2015 at 4:00
@Anonymous yes it is probably a trivial modification but do not seem to be able to do it ;) — Vic152
– Vic152, Commented Mar 23, 2015 at 15:47

Pedro Werneck · Accepted Answer · 2015-03-23 02:12:14Z

3

I would do that with an iterator. Something like this:

def dedup(seq):
    seen = set()
    for v in seq:
        yield '' if v in seen else v
        seen.add(v)

answered Mar 23, 2015 at 2:12

Pedro Werneck

42.1k10 gold badges67 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vic152 Over a year ago

you seem to use for loop would list comprehension be any better?

Pedro Werneck Over a year ago

Better in what sense? Faster? Probably.

srgerg · Accepted Answer · 2015-03-23 04:15:02Z

Edit: reverse the logic to make the meaning clearer:

Another alternative would be to do something like this:

seen = dict()
seen_setdefault = seen.setdefault
new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]

To give an example:

>>> row = ["to", "be", "or", "not", "to", "be"]
>>> seen = dict()
>>> seen_setdefault = seen.setdefault
>>> new_row = ["" if cell in seen else seen_setdefault(cell, cell) for cell in row]
>>> new_row
['to', 'be', 'or', 'not', '', '']

Edit 2: Out of curiosity I ran a quick test to see which approach was fastest:

>>> from random import randint
>>> from statistics import mean
>>> from timeit import repeat
>>>
>>> def standard(seq):
...     """Trivial modification to standard method for removing duplicates."""
...     seen = set()
...     seen_add = seen.add
...     return ["" if x in seen or seen_add(x) else x for x in seq]
...
>>> def dedup(seq):
...     seen = set()
...     for v in seq:
...         yield '' if v in seen else v
...         seen.add(v)
...
>>> def pedro(seq):
...     """Pedro's iterator based approach to removing duplicates."""
...     my_dedup = dedup
...     return [x for x in my_dedup(seq)]
...
>>> def srgerg(seq):
...     """Srgerg's dict based approach to removing duplicates."""
...     seen = dict()
...     seen_setdefault = seen.setdefault
...     return ["" if cell in seen else seen_setdefault(cell, cell) for cell in seq]
...
>>> data = [randint(0, 10000) for x in range(100000)]
>>>
>>> mean(repeat("standard(data)", "from __main__ import data, standard", number=100))
1.2130275770426708
>>> mean(repeat("pedro(data)", "from __main__ import data, pedro", number=100))
3.1519048346103555
>>> mean(repeat("srgerg(data)", "from __main__ import data, srgerg", number=100))
1.2611971098676882

As can be seen from the results, making a relatively simple modification to the standard approach described in this other stack-overflow question is fastest.

Hi Guys! Thanks a lot. I am new to python did not quite understand comprehensions. Trivial modification did the trick: seen=set() seen_add=seen.add new_row = ["" if x in seen or seen_add(x) else x for x in row] @srgerg Thanks a lot!

Saksham Varma · Accepted Answer · 2015-03-23 03:32:25Z

0

You can use a set to keep track of seen items. Using the example list used above:

x = ['to', 'be', 'or', 'not', 'to', 'be']
seen = set()
for index, item in enumerate(x):
    if item in seen:
        x[index] = ''
    else:
        seen.add(item)
print x

answered Mar 23, 2015 at 3:32

Saksham Varma

2,14015 silver badges16 bronze badges

Comments

sumit-sampang-rai · Accepted Answer · 2015-03-23 03:55:48Z

0

You can create a new List and append the element if it is not present in the new List else append None if the element is already present in the new List.

oldList = [3, 1, 'a', 2, 4, 2, 'a', 5, 1, 3]
newList = []

for i in oldList:
    if i in newList:
        newList.append(None)
    else:
        newList.append(i)
print newList

Output:

[3, 1, 'a', 2, 4, None, None, 5, None, None]

answered Mar 23, 2015 at 3:55

sumit-sampang-rai

7011 gold badge7 silver badges17 bronze badges

Collectives™ on Stack Overflow

How do you remove duplicates from a list in Python whilst preserving order and length?

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related