Removing duplicates in a Python list by id

Question

I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:

append all objects to a list while parsing;
sort the list with key=id option;
iterate over the sorted list and remove an element if the previous one has the same id.

Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.

Example. Let's build two identical objects having the same value but a different id (for instance I will take a fractions.Fraction in order to rely on the standard library):

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

Now, if I try to achieve what I want to do by using the pythonical list(set(...)), I get the wrong result as {a,b} keeps only one of both values (which are identical but have a different id).

My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.

Fraction is a bad example, because if you're trying to do this with a standard library class you can't. You can't change the rules set uses, you'd have a build a set-like class of your own (perhaps a thin wrapper around a dictionary keyed on ID). However, if you're actually using a class you control, you can implement __eq__ as shown below. — jonrsharpe
– jonrsharpe, Commented Nov 19, 2016 at 8:52

Jean-François Fabre · Accepted Answer · 2016-11-19 08:53:04Z

5

Be, careful because discrimating by id may fail with some basic types where python optimizes storage when possible:

a = "foo"
b = "foo"
print(a is b)

yields

True

Anyway, if you want to handle standard objects (even non-hashable ones) you can store them in a dictionary with they id as key.

Example with fractions:

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

d = dict()

d[id(a)] = a
d[id(b)] = b

print(d.values())

result:

dict_values([Fraction(1, 3), Fraction(1, 3)])

answered Nov 19, 2016 at 8:53

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kasravnd · Accepted Answer · 2016-11-19 09:10:16Z

4

You should override the __eq__ method so that it depends on the objects id rather than its values. But note that your objects must be hashable as well, so you should define a proper __hash__ method too.

class My_obj:
    def __init__(self, val):
        self.val = val

    def __hash__(self):
        return hash(self.val)

    def __eq__(self, arg):
        return id(self) == id(arg)

    def __repr__(self):
        return str(self.val)

Demo:

a = My_obj(5)
b = My_obj(5)

print({a, b})
{5, 5}

edited Nov 19, 2016 at 9:10

answered Nov 19, 2016 at 8:48

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

4 Comments

Thomas Baruchel Over a year ago

This is a very convenient answer as I use my own class; adding __hash__ and __eq__ should be very easy; in this specific case, can I make __hash__ return the id as a quick way of implementing this method (since my little class is a private one and only exists during the execution of a function and actually is even defined in the body of this function)?

Kasravnd Over a year ago

@ThomasBaruchel Why you want to do that? the hash function is the safest way to go for creating hash values for objects. Because it guarantees to produce random (in python 3) and different values for different variables.

Thomas Baruchel Over a year ago

I was asking that because __hash__ needs to return an integer, and in this step my class (which describes a node in a tree) doesn not embed many usable values but rather lists of other nodes; for that reason I was wondering which significant value I could hash. I tried to make __hash__ return the id and it works well (still faster than my previous attempt), now with list(set(...)). Is it the way to go?

Kasravnd Over a year ago

@ThomasBaruchel It will work because you don't need the hash value and it'll depend on the __eq__ method, so as far as you use id for comparison there is no problem.

Collectives™ on Stack Overflow

Removing duplicates in a Python list by id

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related