146

suppose we have a model in django defined as follows:

class Literal:
    name = models.CharField(...)
    ...

Name field is not unique, and thus can have duplicate values. I need to accomplish the following task: Select all rows from the model that have at least one duplicate value of the name field.

I know how to do it using plain SQL (may be not the best solution):

select * from literal where name IN (
    select name from literal group by name having count((name)) > 1
);

So, is it possible to select this using django ORM? Or better SQL solution?

6 Answers 6

292

Try:

from django.db.models import Count
Literal.objects.values('name')
               .annotate(Count('id')) 
               .order_by()
               .filter(id__count__gt=1)

This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:

dupes = Literal.objects.values('name')
                       .annotate(Count('id'))
                       .order_by()
                       .filter(id__count__gt=1)
Literal.objects.filter(name__in=[item['name'] for item in dupes])
Sign up to request clarification or add additional context in comments.

12 Comments

Probably you have meant Literal.objects.values('name').annotate(name_count=Count('name')).filter(name_count__gt=1)?
Original query gives Cannot resolve keyword 'id_count' into field
Thanks for updated answer, I think I will stick with this solution, you can even do it without list comprehension by using values_list('name', flat=True)
Django previously had a bug on this (might have been fixed in recent versions) where if you don't specify a fieldname for the Count annotation to saved as, it defaults to [field]__count. However, that double-underscore syntax is also how Django interprets you wanting to do a join. So, essentially when you try to filter on that, Django thinks you're trying to do a join with count which obviously doesn't exist. The fix is to specify a name for your annotation result, i.e. annotate(mycount=Count('id')) and then filter on mycount instead.
if you add another call to values('name') after your call to annotate, you can remove the list comprehension and say Literal.objects.filter(name__in=dupes) which will allow this to all be executed in a single query.
|
70

This was rejected as an edit. So here it is as a better answer

dups = (
    Literal.objects.values('name')
    .annotate(count=Count('id'))
    .values('name')
    .order_by()
    .filter(count__gt=1)
)

This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:

Literal.objects.filter(name__in=dups)

The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

3 Comments

Nice trick, unfortunately this will only work if just one value is used (eg. if both 'name' and 'phone' where used, the last part wouldn't work).
What is the .order_by() for?
@stefanfoulis It clears out any existing ordering. If you have a model-set ordering, this becomes part of the SQL GROUP BY clause, and that breaks things. Found that out when playing with Subquery (in which you do very similar grouping via .values())
12

try using aggregation

Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)

2 Comments

Ok, that gives the corrent list of names, but is it possible to selects ids and other fields at the same time?
@dragoon - no but Chris Pratt has covered the alternative in his answer.
9

In case you use PostgreSQL, you can do something like this:

from django.contrib.postgres.aggregates import ArrayAgg
from django.db.models import Func, Value, DecimalField

duplicate_ids = (Literal.objects.values('name')
                 .annotate(ids=ArrayAgg('id'))
                 .annotate(c=Func('ids', Value(1), function='array_length', output_field=DecimalField()))
                 .filter(c__gt=1)
                 .annotate(ids=Func('ids', function='unnest'))
                 .values_list('ids', flat=True))

It results in this rather simple SQL query:

SELECT unnest(ARRAY_AGG("app_literal"."id")) AS "ids"
FROM "app_literal"
GROUP BY "app_literal"."name"
HAVING array_length(ARRAY_AGG("app_literal"."id"), 1) > 1

3 Comments

I tried this but python code gave me an error: FieldError: Expression contains mixed types: ArrayField, IntegerField. You must set output_field.. However, SQL query works as expected (Django 3.2)
Works great (Django 2.2). Also, you don't need the array_length annotation, and can instead filter by ids__len - docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/#len
@oglop with output_field=DecimalField() in Func(), it resolves the issue. Have edited the answer as such.
2

Ok, so for some reason none of the above worked for, it always returned <MultilingualQuerySet []>. I use the following, much easier to understand but not so elegant solution:

dupes = []
uniques = []

dupes_query = MyModel.objects.values_list('field', flat=True)

for dupe in set(dupes_query):
    if not dupe in uniques:
        uniques.append(dupe)
    else:
        dupes.append(dupe)

print(set(dupes))

Comments

0

If you want to result only names list but not objects, you can use the following query

repeated_names = Literal.objects.values('name').annotate(Count('id')).order_by().filter(id__count__gt=1).values_list('name', flat='true')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.