Python: Remove all duplicates from a large list of lists efficiently and elegantly

Question

I have a list of xy coordinates as lists:

print(xy[0:10])

[[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

Shown here are the first 10, but I have ~100,000 coordinate pairs in my list.

I would like to remove all duplicate lists from this list, but efficiently. As an easier to comprehend example, I would like to create a function remove_dupes that produces the following result:

a = [[1, 2], [3, 4], [5, 6], [1, 2], [1, 2], [8, 9], [3, 4]]
b = remove_dupes(a)
print(b)
b = [[5, 6], [8 ,9]]

Please note that order is important to preserve.

However, because I have such a large list, I find using the .count() method and iterating through the list to be too time-consuming. I also tried various tricks with set() and numpy's unique function.

Here is the fastest version I could come up with:

xy = [[x1,y1], [x2,y2], ... [xn, yn]]

def remove_dupes(xy):

    xy = [tuple(p) for p in xy] # Tupelize coordinates list for hashing

    p_hash = [hash(tuple(p)) for p in xy] # Hash each coordinate-pair list to a single value

    counts = Counter(p_hash) # Get counts (dictionary) for each unique hash value

    p_remove = [key for (key, value) in counts.items() if value > 1] # Store keys with count > 1

    p_hash = np.array(p_hash) # Cast from list to numpy array 

    remove = np.zeros((len(xy),1), dtype=np.bool) # Initialize storage

    for p in p_remove: # Loop through each non-unique hash and set all the indices where it appears to True // Most time-consuming portion
        remove[np.where(p==p_hash)[0]] = True

    xy = np.array(xy) # Cast back to numpy array for indexing

    xy = xy[remove.flat==False, :]  # Keep only the non-duplicates

    return xy

This takes ~2 seconds for ~100,000 values (and would take longer if there are more duplicate pairs, triples, etc.). What bothers me is that there are functions like numpy.unique() that return counts and indices in fractions of a second, yet I can't figure out how to conform their outputs to solve this problem. I looked through a couple-dozen other Stackexchange posts that were similar, but I found nothing that was both efficient and elegant. Does anyone have any suggestions for a much more elegant way of solving this than I've presented here?

EDIT:

I've received two answers that provide the correct result (and preserve order). RafaelC provided a Pandas option, and DYZ provided a Counter option. I am not that familiar with how to properly time things, but I ran both for 100 iterations (on the same data) with the following results (using time.time())

Pandas: 13.02 sec

Counter: 28.15 sec

I am not sure why the Pandas implementation is faster; one difference is that the Pandas solution returns tuples (which is OK), so I tried the Counter solution without conversion back to lists and it was still 25 seconds.

Does order matter? Are you concerned about float values that are close but not exactly the same? Is numpy possible? — dawg
– dawg, Commented Sep 21, 2018 at 23:56
@dawg Ordering does matter. I'm not sure what you mean by "concerned about float values," but I can say that duplicates will be exact to the highest-precision digit. — Jon
– Jon, Commented Sep 22, 2018 at 0:00
@RafaelC If you could point me to one such duplicate, I'd appreciate it! Like I mentioned, I read through many posts, but I certainly missed something...the example you provided does not apply to my case. I do not want a list of unique lists, I want to remove all duplicate pairs, triples, quadruples, etc. — Jon
– Jon, Commented Sep 22, 2018 at 0:01
@Jon hm I see your point now. So, just to be clear, [1,2] is different than [2,1] right? — rafaelc
– rafaelc, Commented Sep 22, 2018 at 0:06
@Jon I'm not sure what you mean by "concerned about float values,". The issue is that at times floats do not have 100% comparability. Are .10 and .1000000000000000001 the same number or not? You can get differences from rounding and other artifacts of limited ability to represent floats. It is why you have numpy.isclose — dawg
– dawg, Commented Sep 22, 2018 at 1:32

rafaelc · Accepted Answer · 2018-09-22 00:13:10Z

3

I'd use pandas

s = pd.Series(list(map(tuple, l)))
s[~s.duplicated(keep=False)].tolist()

Takes

211 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for 100000 entries, so a 10x speedup.

answered Sep 22, 2018 at 0:13

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

DYZ Over a year ago

The Counter is 32X faster... Pandas are slow beasts.

Jon Over a year ago

Very nice, this is exactly why I asked. Fast, preserves ordering. I don't believe ordering is preserved with the Counter solution provided by @DYZ. (It is not in my tests, while this method certainly does.) I will do some more testing to make sure this is what I need before accepting.

rafaelc Over a year ago

@DYZ Sorry but no necessarily. How did you time this? It depends a lot on how many duplicated you have, length of the list etc. ;)

DYZ Over a year ago

@Jon It's not about what you believe, it's about what it is. A list comprehension always preserves the order.

Jon Over a year ago

@DYZ see my edit for time comparisons...I find pandas to be 2x faster in my tests!

|

DYZ · Accepted Answer · 2018-09-22 00:48:05Z

2

Consider using a Counter:

from collections import Counter

First, convert your lists into tuples because tuples are immutable. Then count the tuples and select only those that happen once. That's a set for non-duplicates:

nodups = {k for k,cnt in Counter(map(tuple, a)).items() if cnt == 1}

Now, since the order is important, filter the original list against non-dups:

[list(k) for k in map(tuple, a) if k in nodups]
#[[5, 6], [8, 9]]

edited Sep 22, 2018 at 0:48

answered Sep 22, 2018 at 0:06

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

8 Comments

vash_the_stampede Over a year ago

bravo how did I miss map here

Jon Over a year ago

Does this preserve ordering? Testing against my method suggests that it does not, but I may have made a mistake. It does result in the same length of output, so that's good...

DYZ Over a year ago

I tested against your data, and it does preserve the order (perhaps you saw an earlier version of the answer).

dmulter Over a year ago

I can confirm seeing the correct output for the data set you provided in the question.

DYZ Over a year ago

@RafaelC Ok, for an array of 10,000 entries, Pandas is 20% faster. You win.

|

Steven Rumbalski · Accepted Answer · 2018-09-24 19:27:03Z

1

In Python 3.6+ dictionaries keep their insertion order, so DYZ's Counter solution can be much improved by relying on that:

[list(k) for k, c in Counter(map(tuple, a)).items() if c == 1]

On my computer it is faster than the pandas solution.

RafaelC's pandas solution can also be sped up a great deal. The key is to switch from Series to DataFrame:

s = pd.DataFrame(a)
return s[~s.duplicated(keep=False)].values.tolist()

On my computer it is nearly twice as fast as original pandas solution. The key to the speedup is that it avoids doing prep work outside of pandas (list(map(tuple, l))).

answered Sep 24, 2018 at 19:27

Steven Rumbalski

45.7k10 gold badges96 silver badges125 bronze badges

Comments

tbhaxor · Accepted Answer · 2018-09-22 00:24:04Z

0

I have a solution that is efficient and is also built in

import itertools
xy = [[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

xy.sort() # sorting the data
sorted_data = list(xy for xy,_ in itertools.groupby(xy)) # grouping

Note: I have tested two methods, using numpy and itertools. Numpy took 13 secs in data of length 10000000 and intertools took 1 secs in data of length 10000000

answered Sep 22, 2018 at 0:24

tbhaxor

2,0572 gold badges17 silver badges52 bronze badges

3 Comments

Jon Over a year ago

I am assuming that the initial sort does not preserve order, which is important...

dmulter Over a year ago

This does not produce the correct output for the test data.

tbhaxor Over a year ago

The limitation of my approach is you will the order of data. So i will ask you to use it carefully. ps: my mistake, i forgot to mention that

Collectives™ on Stack Overflow

Python: Remove all duplicates from a large list of lists efficiently and elegantly

4 Answers 4

7 Comments

8 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

8 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related