0

Using: Django 1.11, Postgres 9.6

I need to optimise a Django ORM query for consumption by the Django Rest Framework. The query must return records that:

  1. Match a list of IDs in their target field
  2. Limit the result set to records where a source ID appears more than once in the set.

The current approach is to create a subquery using annotate and Count but the sheer amount of processing behind each request added to pagination means that the apps that it's causing timeouts or very slow behaviour.

If there is anything that can be done by Postgres on the server as a raw query I'm fine with that.

Model:

class Relationship(models.Model):
    id = models.AutoField(primary_key=True)
    source = models.BigIntegerField(db_index=True)
    target = models.BigIntegerField(db_index=True)

View snippet:

match_list = [123, 456, 789] # dummy data for example
queryset = Relationship.objects.filter(target__in=match_list)
sub_queryset = (Relationship.objects.filter(target__in=_match_list)
                                    .values('source')
                                    .annotate(source_count=Count("source"))
                                    .filter(source_count__gt=1)
                                    )
sub_ids = [i["source"] for i in sub_queryset]
queryset = (queryset.filter(source__in=sub_ids)
            )

The API takes a list of target IDs as an argument and responds with a list of all source IDs that are connected to that target. However, I'm filtering the queryset to only return source records that are connected to two or more targets.

As background, the resulting queryset will be served by Django Rest Framework and it's currently causing timeouts because the requests get exponentially longer the more

Note: I'm putting this on SO because it's causing my requests to timeout and therefore causing a fault. I know I could extend the timeout duration but would rather optimise the query. I considered CodeReview but felt this was more appropriate.

Edit 1: Following @albar's suggestion, it's currently a separate subquery as the annotate / Count operation only works if the values are returned, not full records

6
  • Could be as simple as adding indexes on target and source. Commented May 4, 2017 at 8:34
  • Thanks @e4c5, good point but already addressed - I have indexes on both fields (see model - "db_index=True") so I believe it's the annotate / Count aggregation step. Commented May 4, 2017 at 8:52
  • I really don't understand the logic of your query, and why those many steps. I think the query you want is simply: Relationship.objects.filter(target__in=_match_list).annotate(source_count=Count("source")).filter(source_count__gt=1). Commented May 4, 2017 at 9:05
  • 1
    @albar I tried that approach initially but the annotation is only applied across the single record, so the source_count value never gets above 1. I had to do the .values operation to enable the aggregation across the whole result set. I think the answer might lie in doing a subquery rather than multiple queries but I've not got much experience in that yet. Commented May 4, 2017 at 9:15
  • @Phil Sheard OK, I understand now. Not easy... Commented May 4, 2017 at 9:23

1 Answer 1

1

One possibly solution is to use a correlated subquery:

duplicates = Relationship.objects.exclude(
    id=OuterRef('id')
).filter(source=OuterRef('source'))

Relationship.objects.annotate(
    duplicated=Exists(duplicates)
).filter(
    duplicated=True
)

What this does is build up a queryset (that can't be evaluated on it's own) that will only contain elements that are duplicates of the main queryset's "source" value. It then filters the elements so only those are selected.

It's not currently possible to do this without a .annotate(...).filter(...) in django, however sometimes that has performance implications: being able to evaluate it only one in the database (in the WHERE clause) can make big improvements to performance.

Sign up to request clarification or add additional context in comments.

1 Comment

This is a nice idea, Matthew - thanks for posting it. It's been a while since I posted the query but I'm still using the codebase so I'll test it against the solution I ended up using.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.