Remove duplicates in a list of lists based on the third item in each sublist

Question

I have a list of lists that looks like:

c = [['470', '4189.0', 'asdfgw', 'fds'],
     ['470', '4189.0', 'qwer', 'fds'],
     ['470', '4189.0', 'qwer', 'dsfs fdv'] 
      ...]

c has about 30,000 interior lists. What I'd like to do is eliminate duplicates based on the 4th item on each interior list. So the list of lists above would look like:

c = [['470', '4189.0', 'asdfgw', 'fds'],['470', '4189.0', 'qwer', 'dsfs fdv'] ...]

Here is what I have so far:

d = [] #list that will contain condensed c
d.append(c[0]) #append first element, so I can compare lists
for bact in c: #c is my list of lists with 30,000 interior list
    for items in d:
        if bact[3] != items[3]:
            d.append(bact)

I think this should work, but it just runs and runs. I let it run for 30 minutes, then killed it. I don't think the program should take so long, so I'm guessing there is something wrong with my logic.

I have a feeling that creating a whole new list of lists is pretty stupid. Any help would be much appreciated, and please feel free to nitpick as I am learning. Also please correct my vocabulary if it is incorrect.

How do you know which of the duplicates needs to be removed? — Tim
– Tim, Commented Jun 18, 2014 at 21:52
Have you considered a separate set of the fourth elements already in the output? This would make the membership lookup much faster. — jonrsharpe
– jonrsharpe, Commented Jun 18, 2014 at 21:53
You're right. I guess I don't need the other information currently. Can you directly create a 'set', or do I iterate through my list of lists creating a list, and then call the set function? — njBernstein
– njBernstein, Commented Jun 18, 2014 at 21:57
@user3754225 you can add to the set as you go along, don't iterate over c twice! — jonrsharpe
– jonrsharpe, Commented Jun 18, 2014 at 22:00
You should check out pandas. If (as I assume) this is not going to be the last table-like operation you do on this data, pandas will be a very good investment. And your operation in pandas is just df.drop_duplicates('col_4') — U2EF1
– U2EF1, Commented Jun 18, 2014 at 22:03

timgeb · Accepted Answer · 2014-06-18 22:47:45Z

7

I'd do it like this:

seen = set()
cond = [x for x in c if x[3] not in seen and not seen.add(x[3])]

Explanation:

seen is a set which keeps track of already encountered fourth elements of each sublist. cond is the condensed list. In case x[3] (where x is a sublist in c) is not in seen, x will be added to cond and x[3] will be added to seen.

seen.add(x[3]) will return None, so not seen.add(x[3]) will always be True, but that part will only be evaluated if x[3] not in seen is True since Python uses short circuit evaluation. If the second condition gets evaluated, it will always return True and have the side effect of adding x[3] to seen. Here's another example of what's happening (print returns None and has the "side-effect" of printing something):

>>> False and not print('hi')
False
>>> True and not print('hi')
hi
True

edited Jun 18, 2014 at 22:47

answered Jun 18, 2014 at 21:56

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

njBernstein Over a year ago

I'm confused about the 'and not'. is that the logical equivalent of if blah then blah? or is it something else completely?

timgeb Over a year ago

@user3754225 I expanded the explanation a bit

U2EF1 · Accepted Answer · 2014-06-18 22:05:51Z

1

Use pandas. I assume you have better column names as well.

c = [['470', '4189.0', 'asdfgw', 'fds'],
     ['470', '4189.0', 'qwer', 'fds'],
     ['470', '4189.0', 'qwer', 'dsfs fdv']]
import pandas as pd
df = pd.DataFrame(c, columns=['col_1', 'col_2', 'col_3', 'col_4'])
df.drop_duplicates('col_4', inplace=True)
print df

  col_1   col_2   col_3     col_4
0   470  4189.0  asdfgw       fds
2   470  4189.0    qwer  dsfs fdv

answered Jun 18, 2014 at 22:05

U2EF1

13.4k3 gold badges38 silver badges38 bronze badges

Comments

jonrsharpe · Accepted Answer · 2014-06-18 22:25:36Z

You have a significant logic flaw in your current code:

for items in d:
    if bact[3] != items[3]:
        d.append(bact)

this adds bact to d once for every item in d that doesn't match. For a minimal fix, you need to switch to:

for items in d:
    if bact[3] == items[3]:
        break
else:
    d.append(bact)

to add bact once if all items in d don't match. I suspect this will mean your code runs in more sensible time.

On top of that, one obvious performance improvement (speed boost, albeit at the cost of memory usage) would be to keep a set of fourth elements you've seen so far. Lookups on the set use hashes, so the membership test (highlighted) will be much quicker.

d = []
seen = set()
for bact in c:
    if bact[3] not in seen: # membership test
        seen.add(bact[3])
        d.append(bact)

Collectives™ on Stack Overflow

Remove duplicates in a list of lists based on the third item in each sublist

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related