2

I have a list of xy coordinates as lists:

print(xy[0:10])

[[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

Shown here are the first 10, but I have ~100,000 coordinate pairs in my list.

I would like to remove all duplicate lists from this list, but efficiently. As an easier to comprehend example, I would like to create a function remove_dupes that produces the following result:

a = [[1, 2], [3, 4], [5, 6], [1, 2], [1, 2], [8, 9], [3, 4]]
b = remove_dupes(a)
print(b)
b = [[5, 6], [8 ,9]]

Please note that order is important to preserve.

However, because I have such a large list, I find using the .count() method and iterating through the list to be too time-consuming. I also tried various tricks with set() and numpy's unique function.

Here is the fastest version I could come up with:

xy = [[x1,y1], [x2,y2], ... [xn, yn]]

def remove_dupes(xy):

    xy = [tuple(p) for p in xy] # Tupelize coordinates list for hashing

    p_hash = [hash(tuple(p)) for p in xy] # Hash each coordinate-pair list to a single value

    counts = Counter(p_hash) # Get counts (dictionary) for each unique hash value

    p_remove = [key for (key, value) in counts.items() if value > 1] # Store keys with count > 1

    p_hash = np.array(p_hash) # Cast from list to numpy array 

    remove = np.zeros((len(xy),1), dtype=np.bool) # Initialize storage

    for p in p_remove: # Loop through each non-unique hash and set all the indices where it appears to True // Most time-consuming portion
        remove[np.where(p==p_hash)[0]] = True

    xy = np.array(xy) # Cast back to numpy array for indexing

    xy = xy[remove.flat==False, :]  # Keep only the non-duplicates

    return xy

This takes ~2 seconds for ~100,000 values (and would take longer if there are more duplicate pairs, triples, etc.). What bothers me is that there are functions like numpy.unique() that return counts and indices in fractions of a second, yet I can't figure out how to conform their outputs to solve this problem. I looked through a couple-dozen other Stackexchange posts that were similar, but I found nothing that was both efficient and elegant. Does anyone have any suggestions for a much more elegant way of solving this than I've presented here?

EDIT:

I've received two answers that provide the correct result (and preserve order). RafaelC provided a Pandas option, and DYZ provided a Counter option. I am not that familiar with how to properly time things, but I ran both for 100 iterations (on the same data) with the following results (using time.time())

Pandas: 13.02 sec

Counter: 28.15 sec

I am not sure why the Pandas implementation is faster; one difference is that the Pandas solution returns tuples (which is OK), so I tried the Counter solution without conversion back to lists and it was still 25 seconds.

10
  • 1
    Does order matter? Are you concerned about float values that are close but not exactly the same? Is numpy possible? Commented Sep 21, 2018 at 23:56
  • @dawg Ordering does matter. I'm not sure what you mean by "concerned about float values," but I can say that duplicates will be exact to the highest-precision digit. Commented Sep 22, 2018 at 0:00
  • 1
    @RafaelC If you could point me to one such duplicate, I'd appreciate it! Like I mentioned, I read through many posts, but I certainly missed something...the example you provided does not apply to my case. I do not want a list of unique lists, I want to remove all duplicate pairs, triples, quadruples, etc. Commented Sep 22, 2018 at 0:01
  • @Jon hm I see your point now. So, just to be clear, [1,2] is different than [2,1] right? Commented Sep 22, 2018 at 0:06
  • 1
    @Jon I'm not sure what you mean by "concerned about float values,". The issue is that at times floats do not have 100% comparability. Are .10 and .1000000000000000001 the same number or not? You can get differences from rounding and other artifacts of limited ability to represent floats. It is why you have numpy.isclose Commented Sep 22, 2018 at 1:32

4 Answers 4

3

I'd use pandas

s = pd.Series(list(map(tuple, l)))
s[~s.duplicated(keep=False)].tolist()

Takes

211 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for 100000 entries, so a 10x speedup.

Sign up to request clarification or add additional context in comments.

7 Comments

The Counter is 32X faster... Pandas are slow beasts.
Very nice, this is exactly why I asked. Fast, preserves ordering. I don't believe ordering is preserved with the Counter solution provided by @DYZ. (It is not in my tests, while this method certainly does.) I will do some more testing to make sure this is what I need before accepting.
@DYZ Sorry but no necessarily. How did you time this? It depends a lot on how many duplicated you have, length of the list etc. ;)
@Jon It's not about what you believe, it's about what it is. A list comprehension always preserves the order.
@DYZ see my edit for time comparisons...I find pandas to be 2x faster in my tests!
|
2

Consider using a Counter:

from collections import Counter

First, convert your lists into tuples because tuples are immutable. Then count the tuples and select only those that happen once. That's a set for non-duplicates:

nodups = {k for k,cnt in Counter(map(tuple, a)).items() if cnt == 1}

Now, since the order is important, filter the original list against non-dups:

[list(k) for k in map(tuple, a) if k in nodups]
#[[5, 6], [8, 9]]

8 Comments

bravo how did I miss map here
Does this preserve ordering? Testing against my method suggests that it does not, but I may have made a mistake. It does result in the same length of output, so that's good...
I tested against your data, and it does preserve the order (perhaps you saw an earlier version of the answer).
I can confirm seeing the correct output for the data set you provided in the question.
@RafaelC Ok, for an array of 10,000 entries, Pandas is 20% faster. You win.
|
1

In Python 3.6+ dictionaries keep their insertion order, so DYZ's Counter solution can be much improved by relying on that:

[list(k) for k, c in Counter(map(tuple, a)).items() if c == 1]

On my computer it is faster than the pandas solution.

RafaelC's pandas solution can also be sped up a great deal. The key is to switch from Series to DataFrame:

s = pd.DataFrame(a)
return s[~s.duplicated(keep=False)].values.tolist()

On my computer it is nearly twice as fast as original pandas solution. The key to the speedup is that it avoids doing prep work outside of pandas (list(map(tuple, l))).

Comments

0

I have a solution that is efficient and is also built in

import itertools
xy = [[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

xy.sort() # sorting the data
sorted_data = list(xy for xy,_ in itertools.groupby(xy)) # grouping

Note: I have tested two methods, using numpy and itertools. Numpy took 13 secs in data of length 10000000 and intertools took 1 secs in data of length 10000000

3 Comments

I am assuming that the initial sort does not preserve order, which is important...
This does not produce the correct output for the test data.
The limitation of my approach is you will the order of data. So i will ask you to use it carefully. ps: my mistake, i forgot to mention that

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.