How to create the histogram of an array with masked values, in Numpy?

Question

In Numpy 1.4.1, what is the simplest or most efficient way of calculating the histogram of a masked array? numpy.histogram and pyplot.hist do count the masked elements, by default!

The only simple solution I can think of right now involves creating a new array with the non-masked value:

histogram(m_arr[~m_arr.mask])

This is not very efficient, though, as this unnecessarily creates a new array. I'd be happy to read about better ideas!

For what it's worth, this would probably be considered a bug in numpy.histogram. You should probably file a bug report and mention it on the mailing list. It's easily fixed by replacing asarray with asanyarray in the numpy.histogram sources. — Joe Kington
– Joe Kington, Commented Aug 31, 2010 at 14:55
Joe, you might want to submit your comment as an answer: I might well mark it as the accepted answer, if confirmed by the Numpy developers. — Eric O. Lebigot
– Eric O. Lebigot, Commented Sep 2, 2010 at 7:41
I sent out a quick question to the list. mail.scipy.org/pipermail/numpy-discussion/2010-September/… We'll see whether or not folks consider it a bug or not. It seems counter intuitive to me at the very least, though. — Joe Kington
– Joe Kington, Commented Sep 2, 2010 at 19:56
For what it's worth, the general consensus was that it was intended behavior, and that such a fix would probably cause more problems than it would fix. E.g.: mail.scipy.org/pipermail/numpy-discussion/2010-September/… — Joe Kington
– Joe Kington, Commented Sep 2, 2010 at 23:36
Thank you, Joe. Can you summarize your comments in an answer. I'd like to mark it as the accepted answer because it shows that there is nothing better than tillsten's good solution. — Eric O. Lebigot
– Eric O. Lebigot, Commented Sep 3, 2010 at 8:09

Joe Kington · Accepted Answer · 2010-09-07 01:32:45Z

17

(Undeleting this as per discussion above...)

I'm not sure whether or not the numpy developers would consider this a bug or expected behavior. I asked on the mailing list, so I guess we'll see what they say.

Either way, it's an easy fix. Patching numpy/lib/function_base.py to use numpy.asanyarray rather than numpy.asarray on the inputs to the function will allow it to properly use masked arrays (or any other subclass of an ndarray) without creating a copy.

Edit: It seems like it is expected behavior. As discussed here:

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant, because I guess full masked array handling inside histogram will be a lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be necessary to figure out what the masked array definition of a histogram is, as Bruce pointed out.

edited Sep 7, 2010 at 1:32

answered Sep 2, 2010 at 20:08

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eric O. Lebigot Over a year ago

Thank you. One of the arguments against handling masked arrays in histograms is that if histograms handled masked values, one would have to decide how masked data with a masked array of weights should be treated. I don't think that there is any obviously better solution to this problem: it looks like histogram()'s features do not mix too well with masked input+weight arrays.

tillsten · Accepted Answer · 2010-09-02 04:56:29Z

10

Try hist(m_arr.compressed()).

answered Sep 2, 2010 at 4:56

tillsten

14.9k5 gold badges34 silver badges43 bronze badges

1 Comment

Eric O. Lebigot Over a year ago

This is a better idea than my m_arr[~m_arr.mask]. However, it does not solve the problem that a new array is unnecessarily corrected.

Erik Hvatum · Accepted Answer · 2016-02-26 21:12:13Z

6

This is a super old question, but these days I just use:

numpy.histogram(m_arr, bins=.., range=.., density=False, weights=m_arr_mask)

Where m_arr_mask is an array with the same shape as m_arr, consisting of 0 values for elements of m_arr to be excluded from the histogram and 1 values for elements that are to be included.

answered Feb 26, 2016 at 21:12

Erik Hvatum

3013 silver badges7 bronze badges

4 Comments

Mad Physicist Over a year ago

Also, this won't work if you try to pass in a string for bins. Great answer aside from that.

PiRK Over a year ago

I can't seem to make it work. When I pass a mask array for weights, the result does not seem consistent with the result I get without mask. I tried passing a mask of random 0 and 1 values, and expected the count in each bin to be divided by approx 2. But it gets divided by more than 20.

PiRK Over a year ago

See filebin.net/jp4x16ekgiyuupgu/Untitled.html?t=2w07ir8v

PiRK Over a year ago

I works fine with a more trivial example (much smaller arrays). Looks like a numpy bug, maybe the weights cause numpy.histogram to use a uint8 array as output, which causes overflow.

PiRK · Accepted Answer · 2020-06-16 13:01:01Z

After running into casting issues by trying Erik's solution (see https://github.com/numpy/numpy/issues/16616), I decided to write a numba function to achieve this behavior.

Some of the code was inspired by https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html. I added the mask bit.

import numpy
import numba  

@numba.jit(nopython=True)
def compute_bin(x, bin_edges):
    # assuming uniform bins for now
    n = bin_edges.shape[0] - 1
    a_min = bin_edges[0]
    a_max = bin_edges[-1]

    # special case to mirror NumPy behavior for last bin
    if x == a_max:
        return n - 1  # a_max always in last bin

    bin = int(n * (x - a_min) / (a_max - a_min))

    if bin < 0 or bin >= n:
        return None
    else:
        return bin


@numba.jit(nopython=True)
def masked_histogram(img, bin_edges, mask):
    hist = numpy.zeros(len(bin_edges) - 1, dtype=numpy.intp)

    for i, value in enumerate(img.flat):
        if mask.flat[i]:
            bin = compute_bin(value, bin_edges)
            if bin is not None:
                hist[int(bin)] += 1
    return hist  # , bin_edges

The speedup is significant. On a (1000, 1000) image:

Collectives™ on Stack Overflow

How to create the histogram of an array with masked values, in Numpy?

4 Answers 4

1 Comment

1 Comment

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related