2

I have a View that returns some statistics about email lists growth. The models involved are:

models.py

class Contact(models.Model):
    email_list = models.ForeignKey(EmailList, related_name='contacts')
    customer = models.ForeignKey('Customer', related_name='contacts')
    status = models.CharField(max_length=8)
    create_date = models.DateTimeField(auto_now_add=True)


class EmailList(models.Model):
    customers = models.ManyToManyField('Customer',
        related_name='lists',
        through='Contact')


class Customer(models.Model):
    is_unsubscribed = models.BooleanField(default=False, db_index=True)
    unsubscribe_date = models.DateTimeField(null=True, blank=True, db_index=True)

In the View what I'm doing is iterating over all EmailLists objects and getting some metrics: the following way:

view.py

class ListHealthView(View):
    def get(self, request, *args, **kwargs):
        start_date, end_date = get_dates_from_querystring(request)

        data = []
        for email_list in EmailList.objects.all():
            # historic data up to start_date
            past_contacts = email_list.contacts.filter(
                status='active',
                create_date__lt=start_date).count()
            past_unsubscribes = email_list.customers.filter(
                is_unsubscribed=True,
                unsubscribe_date__lt=start_date,
                contacts__status='active').count()
            past_deleted = email_list.contacts.filter(
                status='deleted',
                modify_date__lt=start_date).count()
            # data for the given timeframe
            new_contacts = email_list.contacts.filter(
                status='active',
                create_date__range=(start_date, end_date)).count()
            new_unsubscribes = email_list.customers.filter(
                is_unsubscribed=True,
                unsubscribe_date__range=(start_date, end_date),
                contacts__status='active').count()
            new_deleted = email_list.contacts.filter(
                status='deleted',
                modify_date__range=(start_date, end_date)).count()

            data.append({
                'new_contacts': new_contacts,
                'new_unsubscribes': new_unsubscribes,
                'new_deleted': new_deleted,
                'past_contacts': past_contacts,
                'past_unsubscribes': past_unsubscribes,
                'past_deleted': past_deleted,
            })
        return Response({'data': data})

Now this works fine, but as My DB started growing, the response time from this view is above 1s and occasionally will cause long running queries in the Database. I think the most obvious improvement would be to index EmailList.customers but I think maybe it needs to be a compound index ? Also, is there a better way of doing this ? Maybe using aggregates ?

EDIT

After @bdoubleu answer I tried the following:

data = (
    EmailList.objects.annotate(
        past_contacts=Count(Subquery(
            Contact.objects.values('id').filter(
                email_list=F('pk'),
                status='active',
                create_date__lt=start_date)
        )),
        past_deleted=Count(Subquery(
            Contact.objects.values('id').filter(
                email_list=F('pk'),
                status='deleted',
                modify_date__lt=start_date)
        )),
    )
    .values(
        'past_contacts', 'past_deleted',
    )
)

I had to change to use F instead of OuterRef because I realized that my model EmailList has id = HashidAutoField(primary_key=True, salt='...') was causing ProgrammingError: more than one row returned by a subquery used as an expression but I'm not completely sure about it.

Now the query works but sadly all counts are returned as 0

2
  • I'm not Django expert, it seems to be that you're doing multiple queries in the loop body: email_list.contacts.filter, email_list.customers.filter but I'm not sure Commented Sep 11, 2019 at 16:56
  • FYI Subquery won't work with F, it has to be OuterRef Commented Sep 11, 2019 at 19:47

1 Answer 1

6

As is your code is producing 6 queries for every EmailList instance. For 100 instances that's minimum 600 queries which slows things down.

You can optimize by using SubQuery() expressions and .values().

from django.db.models import Count, OuterRef, Subquery

data = (
    EmailList.objects
    .annotate(
        past_contacts=Count(Subquery(
            Contact.objects.filter(
                email_list=OuterRef('pk'),
                status='active',
                create_date__lt=start_date
            ).values('id')
        )),
        past_unsubscribes=...,
        past_deleted=...,
        new_contacts=...,
        new_unsubscribes=...,
        new_deleted=...,
    )
    .values(
        'past_contacts', 'past_unsubscribes',
        'past_deleted', 'new_contacts',
        'new_unsubscribes', 'new_deleted',
    )
)

Update: for older versions of Django your subquery may need to look like below

customers = (
    Customer.objects
    .annotate(
        template_count=Subquery(
            CustomerTemplate.objects
            .filter(customer=OuterRef('pk'))
            .values('customer')
            .annotate(count=Count('*')).values('count')
        )
    ).values('name', 'template_count')
)
Sign up to request clarification or add additional context in comments.

6 Comments

This is awesome, but: ProgrammingError: subquery must return only one column and if I add something like: Contact.objects.values('id').filter(...) I get: ProgrammingError: more than one row returned by a subquery used as an expression
"Very, very definitely" the right answer [concept ...] here. If possible, use an SQL monitor to let you see what the Django-generated queries actually are. Then, use subqueries or whatever-else-you-have-to so that the SQL server is doing the work using as few queries as possible. What's killing you right now is that you're turning-around the communications link 600 times.
@PepperoniPizza my mistake - I forgot to add .values('id') to the end of the subequery (it has to be after the filter). Should be good to go now.
@bdoubleu I would imagine values('id') to return 1 row but: ProgrammingError: more than one row returned by a subquery used as an expression ?
@PepperoniPizza is it possible you missed the Count()? If not could you please paste your code
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.