1

I read, that numpy uses introselect to find the median in an array/ list (https://www.researchgate.net/publication/303755458_Fast_Deterministic_Selection) [page 2; last 5 lines]. But I couldn't find any hints for that in the numpy source code: https://github.com/numpy/numpy/blob/v1.19.0/numpy/lib/function_base.py#L3438-L3525

Does anyone know where I could find the numpy implementation of introselect? Or if numpy doesn't use introselect, what kind of algorithm do the use to find the median?

Many thanks in advance :)

1 Answer 1

2

In line 3528 seems to be the main median function. If we cut out all the multidimensional and nan stuff we get something like

def _median(a, axis=None, out=None, overwrite_input=False):
    # can't be reasonably be implemented in terms of percentile as we have to
    # call mean to not break astropy

    # Set the partition indexes
    sz = a.shape
    if sz % 2 == 0:
        szh = sz // 2
        kth = [szh - 1, szh]
    else:
        kth = [(sz - 1) // 2]

    part = partition(a, kth, axis=None)

    return mean(part[indexer], axis=None, out=out)

So partition is doing all the work and comes from

from numpy.core.fromnumeric import (
    ravel, nonzero, partition, mean, any, sum
    )

If we go to the numpy code we get to the following C code.

NPY_SELECTKIND sortkind = NPY_INTROSELECT;

and

val = PyArray_Partition(self, ktharray, axis, sortkind);

Which implemented here and uses

mid = ll + median_of_median5_@suff@(v + ll, hh - ll, NULL, NULL);

So it is introselect.

Once twice the recursion depth is reached the algorithm change to use meadian-of-median5 until the partition is less than 5.

Sign up to request clarification or add additional context in comments.

4 Comments

Hey thanks for your answer. I'm not sure if thats to specific, but do you know what kind of criterion they use to decide if they should switch to median of medians? They say in Line 341 in the C code: depth_limit = npy_get_msb(num) * 2; and later: ll = low + 1 and hh = high (line: 345) but what kind of start values do ll and hh have? And what value does the depth_limit have or how is it calculated. (I'm sorry for that kinda "dumb" question, but I'm not that familiar with C )
@maxpower its low and high in the array, shrinking the part that still needs to be partitioned.
and how is the depth_limit calculated? or more specifically what does npy_get_msb(num) mean and what's the value of num? Thank you very much, you're helping me alot :)
msb - most significant bit, how many times you can half the number.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.