1

I have the following model:

class Item(models.Model):

    unique_code = models.CharField(max_length=100)
    category_code = models.CharField(max_length=100)
    label = models.CharField(max_length=100)

I would like to get:

  • the count of the different category_codes used

  • count of the different unique_codes used

  • count of the different combination of category_code and unique_code used


Any ideas?

3 Answers 3

3

Django/SQL solution as requested:

the count of the different category_codes used:

category_codes_cnt = Item.objects.values('category_codes').distinct().count()

count of the different unique_codes used:

unique_codes_cnt = Item.objects.values('unique_codes').distinct().count()

count of the different combination of category_code and unique_code used:

codes_cnt = Item.objects.values('category_codes', 'unique_codes').distinct().count()
Sign up to request clarification or add additional context in comments.

Comments

1

Don't waste too much time trying finesse a cool SQL solution.

from collections import defaultdict
count_cat_code = defaultdict(int)
count_unique_code = defaultdict(int)
count_combo_code = defaultdict(int)
for obj in Item.objects.all():
    count_cat_code[obj.category_code] += 1
    count_unique_code[obj.unique_code] += 1
    count_combo_code[obj.category_code,obj.unique_code] += 1

That will do it. And it will work reasonably quickly. Indeed, if you do some benchmarking, you may find that -- sometimes -- it's as fast as a "pure SQL" statement.

[Why? Because RDBMS must use a fairly inefficient algorithm for doing GROUP BY and Counts. In Python we have the luxury of assuming some things based on our application and our knowledge of the data. In this case, for example, I assumed that it would all fit in memory. An assumption that cannot be made by the RDBMS internal algorithms.]

9 Comments

@S.Lott: Thanks!! :) But what if it contains 1Million rows, which is kind of the case here. Any tips?
Doesn't matter. SQL isn't much faster than Python for this. All million rows must be read no matter what. SQL often has to sort them into temporary storage before it can do the GROUP BY, leading to a O ( n log(n) ) kind of performance. In Python you make one fast pass through the data collecting everything you need. Your hash tables are all O (1) so it's net O (n).
otoh, that does mean every row has to be marshalled/transmitted/unmarshalled, which may turn out to be more overhead than hashing. don't guess, measure. at least for a small example I used, postgresql will do a hash aggregate anyway.
@araqnid: "does mean every row has to be marshalled/transmitted/unmarshalled". Yes. "which may turn out to be more overhead than hashing."? What? The database can't do the hashing for you. And GROUP BY in SQL may not involve any hashing. GROUP BY in SQL often involves an huge, expensive sort. A table scan can be faster.
This will be significantly slower than a SQL query that calculates a total within the database program and transmits result to the client, especially if the object is large and plentiful.
|
0
select count(distinct unique_code) as unique_code_count,
       count(distinct category_code) as category_code_count,
       count(*) as combination_count
from (select unique_code, category_code, count(*) as combination_count
      from item
      group by unique_code, category_code) combination

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.