2

The relationship between Foo and Bar is through Baz as follows:

class Foo(Model):
   # stuff

class Bar(Model)
   # stuff

class Baz(Model):
   foos = ManyToManyField("Foo")
   bar = ForeignKey("Bar")

I basically need to generate the following dict representing the Bars that are related to each Foo through Baz (in dict comprehension pseudo-code):

{ foo.id: [list of unique bars related to the foo through any baz] for foo in all foos}

I can currently generate my data structure with O(N) queries (1 query per Foo), but with lots of data this is a bottleneck, and I need it optimized to O(1) (not a single query per se, but a fixed number of queries irrespective of data size of any of the models), while also minimizing iterations of the data in python.

2
  • How would you write it in SQL? Commented Jan 8, 2015 at 23:31
  • If I knew that, I could translate it to Django. I am proficient with basic SQL but am, alas, not an OUTER JOIN ninja. :) Commented Jan 8, 2015 at 23:34

2 Answers 2

3

If you can drop to SQL, you could use the single query (the appname should prefix all the tables names):

select distinct foo.id, bar.id
from baz_foos
join baz on baz_foos.baz_id = baz.id
join foo on baz_foos.foo_id = foo.id
join bar on baz.bar_id = bar.id

baz_foos is the many-to-many table Django creates.

@Alasdair's solution is possibly/probably more readable (although if you're doing this for performance reasons that might not be most important). His solution uses exactly two queries (which is hardly a difference). The only problem I see is if you have a large number of Baz objects since the generated sql looks like this:

SELECT "foobar_baz"."id", "foobar_baz"."bar_id", "foobar_bar"."id" 
FROM "foobar_baz" 
INNER JOIN "foobar_bar" ON ("foobar_baz"."bar_id" = "foobar_bar"."id")

SELECT
    ("foobar_baz_foos"."baz_id") AS "_prefetch_related_val", 
    "foobar_foo"."id" 
FROM "foobar_foo" 
INNER JOIN "foobar_baz_foos" ON ("foobar_foo"."id" = "foobar_baz_foos"."foo_id") 
WHERE "foobar_baz_foos"."baz_id" IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
    15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 
    35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 
    55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
    75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
    95, 96, 97, 98, 99, 100, 101)

If you have only a few Bar's and a few hundred Foo's, I would do:

from django.db import connection
from collections import defaultdict

# foos = {f.id: f for f in Foo.objects.all()}
bars = {b.id: b for b in Bar.objects.all()}

c = connection.cursor()
c.execute(sql)  # from above
d = defaultdict(set)
for f_id, b_id in c.fetchall():
    d[f_id].add(bars[b_id])
Sign up to request clarification or add additional context in comments.

3 Comments

Yeah, unfortunately of the data sets, baz is typically the largest and least limited (there are typically no more than a few bar's and maybe 30-100 foos) so I may go ahead and write the sql if i can figure out how to make it play nice with python.
I've updated the answer (the solution is basically doing a manual 'prefetch_related' on Bar).
I do like the straightforwardness and cleanness of Alasdair's method, but this is great for situations where optimization is key and Baz has a big data set. Learned some good things here. Thanks, thebjorn.
2

Using select_related and prefetch_related, I think you can build the required data structure with 2 queries:

out = {}
bazes = Baz.objects.select_related('bar').prefetch_related('foos')
for baz in bazes:
    for foo in baz.foos.all():
        out.setdefault(foo.id, set()).add(baz.bar)

The values of the output dictionary are sets, not lists as in your question, to ensure uniqueness.

5 Comments

Why not use a collections.defaultdict(set)?
There's a gotcha when using a default dict in a Django template. No other reason. Using a default dict would be fine, and you could convert to a regular dict afterwards if required.
interesting bit from the 1.7 docs regarding large datasets: "prefetch_related in most cases will be implemented using an SQL query that uses the ‘IN’ operator. This means that for a large QuerySet a large ‘IN’ clause could be generated, which, depending on the database, might have performance problems of its own when it comes to parsing or executing the SQL query. Always profile for your use case!"
Yeah I've noticed that prefetch_related basically uses an IN clause with a list of ID's. I'll profile it as I implement this and see what we're looking at.
Because its so straightforward, I think this is the correct answer for everything except special situations (where optimization is paramount and baz has a huge data set) warranting the complexity of @thebjorn's response.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.