51

I've got a list of objects and I've got a db table full of records. My list of objects has a title attribute and I want to remove any objects with duplicate titles from the list (leaving the original).

Then I want to check if my list of objects has any duplicates of any records in the database and if so, remove those items from list before adding them to the database.

I have seen solutions for removing duplicates from a list like this: myList = list(set(myList)), but i'm not sure how to do that with a list of objects?

I need to maintain the order of my list of objects too. I was also thinking maybe I could use difflib to check for differences in the titles.

3
  • leaving the original , what this mean ? because if like you said you want to maintain order of the list so the first occurrence of a duplicate object in the list will be the original right ? Commented Nov 12, 2010 at 21:56
  • yes, I just meant I want to remove all of the duplicates except for the original. @S.Lott, I did search a ton and I didn't find anything, that's why I came here. Can you cite an example that address this exact problem? I would be happy to see it. Commented Nov 12, 2010 at 22:25
  • stackoverflow.com/…. Commented Nov 15, 2010 at 16:51

8 Answers 8

73

The set(list_of_objects) will only remove the duplicates if you know what a duplicate is, that is, you'll need to define a uniqueness of an object.

In order to do that, you'll need to make the object hashable. You need to define both __hash__ and __eq__ method, here is how:

http://docs.python.org/glossary.html#term-hashable

Though, you'll probably only need to define __eq__ method.

EDIT: How to implement the __eq__ method:

You'll need to know, as I mentioned, the uniqueness definition of your object. Supposed we have a Book with attributes author_name and title that their combination is unique, (so, we can have many books Stephen King authored, and many books named The Shining, but only one book named The Shining by Stephen King), then the implementation is as follows:

def __eq__(self, other):
    return self.author_name==other.author_name\
           and self.title==other.title

Similarly, this is how I sometimes implement the __hash__ method:

def __hash__(self):
    return hash(('title', self.title,
                 'author_name', self.author_name))

You can check that if you create a list of 2 books with same author and title, the book objects will be the same (with is operator) and equal (with == operator). Also, when set() is used, it will remove one book.

EDIT: This is one old anwser of mine, but I only now notice that it has the error which is corrected with strikethrough in the last paragraph: objects with the same hash() won't give True when compared with is. Hashability of object is used, however, if you intend to use them as elements of set, or as keys in dictionary.

Sign up to request clarification or add additional context in comments.

5 Comments

Nice, I didn't know about __hash__ and __eq__. Any examples on how to implement __eq__?
you need to make sure the class is same or the field wont be available so eq also needs to do self.__class__ == other.__class__ and self.author_name==other.author_name\ and self.title==other.title
Do we know which one of the "duplicates" is kept and which one is discarded? Following the book example, let's say they have a field publication_date (the same book can have multiple editions, hence multiple publication dates). If the list is initially ordered by most recent to oldest, and I remove duplicates using this technique (disregarding the publication_date when defining __eq__), do I know which one is kept and which one is discarded?
The OP requested order be preserved, but set does not maintain order.
This answer is not complete as it doesn't have a clue how to use these 2 methods.
28

Since they're not hashable, you can't use a set directly. The titles should be though.

Here's the first part.

seen_titles = set()
new_list = []
for obj in myList:
    if obj.title not in seen_titles:
        new_list.append(obj)
        seen_titles.add(obj.title)

You're going to need to describe what database/ORM etc. you're using for the second part though.

4 Comments

I'm using mysql with sqlobject.
@bababa please update the question so that other people see it as well.
@bababa, I don't see a good way to do this using sqlobject (i.e. without pulling every object from the DB in one query or making one query per object) so I'll wait a while and then post that if somebody that doesn't know sqlobject better than I do doesn't come along.
Just out of curiousity, why did you use a set instead of a dict? isn't dict key checking O(1) as well?
9

If you can't (or won't) define __eq__ for the objects, you can use a dict-comprehension to achieve the same end:

unique = list({item.attribute: item for item in mylist}.values())

Note that this will contain the last instance of a given key, e.g. for mylist = [Item(attribute=1, tag='first'), Item(attribute=1, tag='second'), Item(attribute=2, tag='third')] you get [Item(attribute=1, tag='second'), Item(attribute=2, tag='third')]. You can get around this by using mylist[::-1] (if the full list is present).

Comments

6

This seems pretty minimal:

new_dict = dict()
for obj in myList:
    if obj.title not in new_dict:
        new_dict[obj.title] = obj

Comments

3

For non-hashable types you can use a dictionary comprehension to remove duplicate objects based on a field in all objects. This is particularly useful for Pydantic which doesn't support hashable types by default:

{ row.title : row for row in rows }.values()

Note that this will consider duplicates solely based on on row.title, and will take the last matched object for row.title. This means if your rows may have the same title but different values in other attributes, then this won't work.

e.g. [{"title": "test", "myval": 1}, {"title": "test", "myval": 2}] ==> [{"title": "test", "myval": 2}]

If you wanted to match against multiple fields in row, you could extend this further:

{ f"{row.title}\0{row.value}" : row for row in rows }.values()

The null character \0 is used as a separator between fields. This assumes that the null character isn't used in either row.title or row.value.

3 Comments

I just noticed that this is pretty much the same answer as @Dave but adds a bit more detail, my apologies for the duplicate answer!
This is actually a good idiom (if a bit kludgy).
For the multi-key case, just using a tuple {(row.title, row.value):row for row in rows}.values() is better -- don't have to assume the fields are strings or that the null character doesn't appear in them
0

Both __hash__ and __eq__ are needed for this.

__hash__ is needed to add an object to a set, since python's sets are implemented as hashtables. By default, immutable objects like numbers, strings, and tuples are hashable.

However, hash collisions (two distinct objects hashing to the same value) are inevitable, due to the pigeonhole principle. So, two objects cannot be distinguished only using their hash, and the user must specify their own __eq__ function. Thus, the actual hash function the user provides is not crucial, though it is best to try to avoid hash collisions for performance (see What's a correct and good way to implement __hash__()?).

Comments

0

I recently ended up using the code below. It is similar to other answers as it iterates over the list and records what it is seeing and then removes any item that it has already seen but it doesn't create a duplicate list, instead it just deletes the item from original list.

seen = {}
for obj in objList:
    if obj["key-property"] in seen.keys():
        objList.remove(obj)
    else:
        seen[obj["key-property"]] = 1

1 Comment

This only works if the objList contains objects that are comparable (i.e. implementing the eq method). For more information see stackoverflow.com/a/11456817/290588 Creating a deduplicated list would work for objects that do not implement eq.
-15

Its quite easy freinds :-

a = [5,6,7,32,32,32,32,32,32,32,32]

a = list(set(a))

print (a)

[5,6,7,32]

thats it ! :)

1 Comment

Cannot do this on a list that contains objects.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.