Optimise Django subquery (currently using `annotate` and `Count`)

Question

Using: Django 1.11, Postgres 9.6

I need to optimise a Django ORM query for consumption by the Django Rest Framework. The query must return records that:

Match a list of IDs in their target field
Limit the result set to records where a source ID appears more than once in the set.

The current approach is to create a subquery using annotate and Count but the sheer amount of processing behind each request added to pagination means that the apps that it's causing timeouts or very slow behaviour.

If there is anything that can be done by Postgres on the server as a raw query I'm fine with that.

Model:

class Relationship(models.Model):
    id = models.AutoField(primary_key=True)
    source = models.BigIntegerField(db_index=True)
    target = models.BigIntegerField(db_index=True)

View snippet:

match_list = [123, 456, 789] # dummy data for example
queryset = Relationship.objects.filter(target__in=match_list)
sub_queryset = (Relationship.objects.filter(target__in=_match_list)
                                    .values('source')
                                    .annotate(source_count=Count("source"))
                                    .filter(source_count__gt=1)
                                    )
sub_ids = [i["source"] for i in sub_queryset]
queryset = (queryset.filter(source__in=sub_ids)
            )

The API takes a list of target IDs as an argument and responds with a list of all source IDs that are connected to that target. However, I'm filtering the queryset to only return source records that are connected to two or more targets.

As background, the resulting queryset will be served by Django Rest Framework and it's currently causing timeouts because the requests get exponentially longer the more

Note: I'm putting this on SO because it's causing my requests to timeout and therefore causing a fault. I know I could extend the timeout duration but would rather optimise the query. I considered CodeReview but felt this was more appropriate.

Edit 1: Following @albar's suggestion, it's currently a separate subquery as the annotate / Count operation only works if the values are returned, not full records

Thanks @e4c5, good point but already addressed - I have indexes on both fields (see model - "db_index=True") so I believe it's the annotate / Count aggregation step. — Phil Sheard
– Phil Sheard, Commented May 4, 2017 at 8:52
I really don't understand the logic of your query, and why those many steps. I think the query you want is simply: Relationship.objects.filter(target__in=_match_list).annotate(source_count=Count("source")).filter(source_count__gt=1). — albar
– albar, Commented May 4, 2017 at 9:05
@albar I tried that approach initially but the annotation is only applied across the single record, so the source_count value never gets above 1. I had to do the .values operation to enable the aggregation across the whole result set. I think the answer might lie in doing a subquery rather than multiple queries but I've not got much experience in that yet. — Phil Sheard
– Phil Sheard, Commented May 4, 2017 at 9:15

Matthew Schinckel · Accepted Answer · 2018-08-27 07:34:47Z

1

One possibly solution is to use a correlated subquery:

duplicates = Relationship.objects.exclude(
    id=OuterRef('id')
).filter(source=OuterRef('source'))

Relationship.objects.annotate(
    duplicated=Exists(duplicates)
).filter(
    duplicated=True
)

What this does is build up a queryset (that can't be evaluated on it's own) that will only contain elements that are duplicates of the main queryset's "source" value. It then filters the elements so only those are selected.

It's not currently possible to do this without a .annotate(...).filter(...) in django, however sometimes that has performance implications: being able to evaluate it only one in the database (in the WHERE clause) can make big improvements to performance.

answered Aug 27, 2018 at 7:34

Matthew Schinckel

35.7k6 gold badges91 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Phil Sheard Over a year ago

This is a nice idea, Matthew - thanks for posting it. It's been a while since I posted the query but I'm still using the codebase so I'll test it against the solution I ended up using.

Collectives™ on Stack Overflow

Optimise Django subquery (currently using `annotate` and `Count`)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related