Django select only rows with duplicate field values

Question

suppose we have a model in django defined as follows:

class Literal:
    name = models.CharField(...)
    ...

Name field is not unique, and thus can have duplicate values. I need to accomplish the following task: Select all rows from the model that have at least one duplicate value of the name field.

I know how to do it using plain SQL (may be not the best solution):

select * from literal where name IN (
    select name from literal group by name having count((name)) > 1
);

So, is it possible to select this using django ORM? Or better SQL solution?

John Mee · Accepted Answer · 2017-01-22 22:27:26Z

292

Try:

from django.db.models import Count
Literal.objects.values('name')
               .annotate(Count('id')) 
               .order_by()
               .filter(id__count__gt=1)

This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:

dupes = Literal.objects.values('name')
                       .annotate(Count('id'))
                       .order_by()
                       .filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])

edited Jan 22, 2017 at 22:27

John Mee

52.6k39 gold badges158 silver badges196 bronze badges

answered Jan 24, 2012 at 15:24

Chris Pratt

240k37 gold badges414 silver badges469 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

dragoon Over a year ago

Probably you have meant Literal.objects.values('name').annotate(name_count=Count('name')).filter(name_count__gt=1)?

dragoon Over a year ago

Original query gives Cannot resolve keyword 'id_count' into field

dragoon Over a year ago

Thanks for updated answer, I think I will stick with this solution, you can even do it without list comprehension by using values_list('name', flat=True)

Chris Pratt Over a year ago

Django previously had a bug on this (might have been fixed in recent versions) where if you don't specify a fieldname for the Count annotation to saved as, it defaults to [field]__count. However, that double-underscore syntax is also how Django interprets you wanting to do a join. So, essentially when you try to filter on that, Django thinks you're trying to do a join with count which obviously doesn't exist. The fix is to specify a name for your annotation result, i.e. annotate(mycount=Count('id')) and then filter on mycount instead.

Piper Merriam Over a year ago

if you add another call to values('name') after your call to annotate, you can remove the list comprehension and say Literal.objects.filter(name__in=dupes) which will allow this to all be executed in a single query.

|

Daniel Holmes · Accepted Answer · 2019-07-01 12:44:55Z

70

This was rejected as an edit. So here it is as a better answer

dups = (
    Literal.objects.values('name')
    .annotate(count=Count('id'))
    .values('name')
    .order_by()
    .filter(count__gt=1)
)

This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:

Literal.objects.filter(name__in=dups)

The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

edited Jul 1, 2019 at 12:44

Daniel Holmes

2,0222 gold badges19 silver badges30 bronze badges

answered Apr 13, 2013 at 14:21

Piper Merriam

2,9942 gold badges26 silver badges31 bronze badges

3 Comments

gdvalderrama Over a year ago

Nice trick, unfortunately this will only work if just one value is used (eg. if both 'name' and 'phone' where used, the last part wouldn't work).

stefanfoulis Over a year ago

What is the .order_by() for?

Oli Over a year ago

@stefanfoulis It clears out any existing ordering. If you have a model-set ordering, this becomes part of the SQL GROUP BY clause, and that breaks things. Found that out when playing with Subquery (in which you do very similar grouping via .values())

JamesO · Accepted Answer · 2012-01-24 15:26:17Z

12

try using aggregation

Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)

answered Jan 24, 2012 at 15:26

JamesO

26.2k4 gold badges44 silver badges42 bronze badges

2 Comments

dragoon Over a year ago

Ok, that gives the corrent list of names, but is it possible to selects ids and other fields at the same time?

JamesO Over a year ago

@dragoon - no but Chris Pratt has covered the alternative in his answer.

SaeX · Accepted Answer · 2024-10-17 07:16:29Z

9

In case you use PostgreSQL, you can do something like this:

from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value, DecimalField

duplicate_ids = (Literal.objects.values('name')
                 .annotate(ids=ArrayAgg('id'))
                 .annotate(c=Func('ids', Value(1), function='array_length', output_field=DecimalField()))
                 .filter(c__gt=1)
                 .annotate(ids=Func('ids', function='unnest'))
                 .values_list('ids', flat=True))

It results in this rather simple SQL query:

SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1

edited Oct 17, 2024 at 7:16

SaeX

19k18 gold badges81 silver badges104 bronze badges

answered Mar 9, 2017 at 7:22

Eugene Pakhomov

11.2k3 gold badges32 silver badges57 bronze badges

3 Comments

oglop Over a year ago

I tried this but python code gave me an error: FieldError: Expression contains mixed types: ArrayField, IntegerField. You must set output_field.. However, SQL query works as expected (Django 3.2)

a regular fellow Over a year ago

Works great (Django 2.2). Also, you don't need the array_length annotation, and can instead filter by ids__len - docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/#len

SaeX Over a year ago

@oglop with output_field=DecimalField() in Func(), it resolves the issue. Have edited the answer as such.

Özer · Accepted Answer · 2021-03-20 15:36:13Z

2

Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:

dupes = []
uniques = []

dupes_query = MyModel.objects.values_list('field', flat=True)

for dupe in set(dupes_query):
    if not dupe in uniques:
        uniques.append(dupe)
    else:
        dupes.append(dupe)

print(set(dupes))

edited Mar 20, 2021 at 15:36

answered Mar 20, 2021 at 15:18

Özer

2,11420 silver badges23 bronze badges

Comments

arulmr · Accepted Answer · 2014-01-22 07:47:33Z

0

If you want to result only names list but not objects, you can use the following query

repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')

edited Jan 22, 2014 at 7:47

arulmr

8,8629 gold badges57 silver badges70 bronze badges

answered Dec 30, 2013 at 4:52

user2959723

6812 gold badges7 silver badges13 bronze badges

Collectives™ on Stack Overflow

Django select only rows with duplicate field values

6 Answers 6

12 Comments

3 Comments

2 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

12 Comments

3 Comments

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related