15

The Development version of Django has aggregate functions like Avg, Count, Max, Min, StdDev, Sum, and Variance (link text). Is there a reason Median is missing from the list?

Implementing one seems like it would be easy. Am I missing something? How much are the aggregate functions doing behind the scenes?

0

9 Answers 9

28

Here's your missing function. Pass it a queryset and the name of the column that you want to find the median for:

def median_value(queryset, term):
    count = queryset.count()
    return queryset.values_list(term, flat=True).order_by(term)[int(round(count/2))]

That wasn't as hard as some of the other responses seem to indicate. The important thing is to let the db sorting do all of the work, so if you have the column already indexed, this is a super cheap operation.

(update 1/28/2016) If you want to be more strict about the definition of median for an even number of items, this will average together the value of the two middle values.

def median_value(queryset, term):
    count = queryset.count()
    values = queryset.values_list(term, flat=True).order_by(term)
    if count % 2 == 1:
        return values[int(round(count/2))]
    else:
        return sum(values[count/2-1:count/2+1])/Decimal(2.0)
Sign up to request clarification or add additional context in comments.

4 Comments

There is a small inaccuracy in this implementation, when the number of elements is even. Quote from en.wikipedia.org/wiki/Median : "If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values". I think that once the values_list is retrieved, its best to use a python 'median' function (for such a function, see this thread: stackoverflow.com/questions/24101524/…)
@o_c That's a valid point, but I don't think that it's a good idea to use python's median function on the whole data set -- that's an expensive operation where all i really need to do is make a small change if count is even. I'll see if I can throw something together.
I know this thread is a little bit old, but is your updated answer really accurate? e.g. if there are three values in the queryset return values[int(round(3/2))] will resolve to values[2]. Imho the correct solution would be return values[int(count/2)] which would resolve to values[1].
@RaideR This answer was originally written for python 2, which has different rounding behavior. See stackoverflow.com/questions/21839140/… . So it makes sense that there are now some off-by-one errors with Python 3 depending on if the number is odd or even. Will have a closer look.
15

Because median isn't a SQL aggregate. See, for example, the list of PostgreSQL aggregate functions and the list of MySQL aggregate functions.

Comments

7

Well, the reason is probably that you need to track all the numbers to calculate median. Avg, Count, Max, Min, StDev, Sum, and Variance can all be calculated with constant storage needs. That is, once you "record" a number you'll never need it again.

FWIW, the variables you need to track are: min, max, count, <n> = avg, <n^2> = avg of the square of the values.

Comments

2

A strong possibility is that median is not part of standard SQL.

Also, it requires a sort, making it quite expensive to compute.

3 Comments

There are linear, non sorting, algorithms: valis.cs.uiuc.edu/~sariel/research/CG/applets/linear_prog/…
Wrong algorithm, I meant median of medians: en.wikipedia.org/wiki/…
@Todd Gardner: The first link is the "partition-based general selection" and it's O(nlogn) not linear. The site is wrong. It would be nice to delete that comment, but leave the median-of-medians comment.
2

I have no idea what db backend you are using, but if your db supports another aggregate, or you can find a clever way of doing it, You can probably access it easily by Aggregate.

Comments

1

FWIW, you can extend PostgreSQL 8.4 and above to have a median aggregate function with these code snippets.

Other code snippets (which work for older versions of PostgreSQL) are shown here. Be sure to read the comments for this resource.

Comments

-1

I have implemented the following version now as I also got some weird results with the above implementation and rounding behaviour.

Using Python 3.11.4

def median_value(queryset, column_name):
    count = queryset.count()
    values = queryset.values_list(column_name, flat=True).order_by(column_name)
    if count == 0:
        return Decimal("0")
    if count % 2 == 1:
        return values[int(count/2)]
    else:
        return sum(values[int(count/2)-1:int(count/2)+1])/Decimal(2.0)

1 Comment

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
-1

This is based on the solution from @Mark Chackerian but addesses the following issues:

  • If the base queryset is empty, the result is None
  • The data might change between the queries for count() and values(), so this locks the queryset inside a transaction.

The result type will generally be the same as the database field type (Decimal, int, or float) with the only exception being an int column of even values, which will result in a float.

For example, the median of [1,2,3] is 2 (int) while [1,2] yields 1.5 (float). If you need the result to be an int in this case, apply round or floor

This requires Python 3.10+, although after removing the type annotations it probably works with earlier versions, too.

from decimal import Decimal

from django.db import transaction
from django.db.models import QuerySet


def median_value(queryset: QuerySet, field: str) -> Decimal | float | int | None:
    with transaction.atomic():
        count = queryset.select_for_update().count()
        if count == 0:
            return None
        values = queryset.values_list(field, flat=True).order_by(field)
        if count % 2 == 1:
            return values[count // 2]
        else:
            return sum(values[count // 2 - 1 : count // 2 + 1]) / 2

Comments

-1

If you don't mind an external dependency, you can use tailslide:

from tailslide import Median

Item.objects.aggregate(Median('price'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.