1

I have a model

class Foo(models.Model): 
     first = models.CharField()
     second = models.CharField()

data I have is

first second
1     2
1     2
1     2
3     4

Now I want to delete all duplicate rows and keep one entry. The end result

first second
1     2
3     4

How do I do this? I checked this question but could not figure it out properly. Annotate

I have tried

foo_ids = Foo.objects.annotate(first_c=Count('first'), second_c=Count('second')).filter(first_c__gt=1, second_c__gt=1).values('first', 'second', 'id')

Then try and figure out how to not delete one of each I list of duplicates.

1

2 Answers 2

1

I ended up doing this.

from django.db.models import Count
duplicate_foo = Foo.objects.values('req_group','child_program__id').annotate(id_c=Count('id')).filter(id_c__gt=1)
for dups in duplicate_foo:
    for i, val in enumerate(Foo.objects.filter(first=dups['first'],                                                      
                                               second=dups['second'])):
        if i ==0:
            continue
        val.delete()

Not the most optimzed solution. But it works.

Sign up to request clarification or add additional context in comments.

2 Comments

Speed depends on how many duplicates you have. If most of your records are not duplicated, this is faster than Exprator's answer.
Calling delete() on every single duplicate still triggers one DB query for each duplicate. That's a massive overhead if you deal with thousands of duplicates.
0

It's an older thread, but both answers don't fit the bill for large datasets and lead to a huge number of queries.

You can use this generic method instead:

from django.apps import apps
from django.db.models import Count, Max


def get_duplicates_from_fields(app, model, fields, delete=False, **filters):
    """
    Returns duplicate records based on a list of fields,
    optionally deletes duplicate records.
    """
    Model = apps.get_model(app, model)
    duplicates = (Model.objects.filter(**filters).values(*fields)
                  .order_by()
                  .annotate(_max_id=Max('id'), _count=Count('id'))
                  .filter(_count__gt=1))

    for duplicate in duplicates:
        if delete:
            (
                Model.objects
                .filter(**filters)
                .filter(**{x: duplicate[x] for x in fields})
                .exclude(id=duplicate['_max_id'])
                .delete()
            )
        else:
            print(duplicate)

You can use this method as such:

get_duplicates_from_fields('myapp', 'Foo', ['first', 'second'], True)

This lets you find and delete duplicate records based on any number of fields.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.