I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:
- append all objects to a list while parsing;
- sort the list with
key=idoption; - iterate over the sorted list and remove an element if the previous one has the same id.
Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.
Example. Let's build two identical objects having the same value but a different id (for instance I will take a fractions.Fraction in order to rely on the standard library):
from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)
Now, if I try to achieve what I want to do by using the pythonical list(set(...)), I get the wrong result as {a,b} keeps only one of both values (which are identical but have a different id).
My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.
Fractionis a bad example, because if you're trying to do this with a standard library class you can't. You can't change the rulessetuses, you'd have a build a set-like class of your own (perhaps a thin wrapper around a dictionary keyed on ID). However, if you're actually using a class you control, you can implement__eq__as shown below.