18

In Numpy 1.4.1, what is the simplest or most efficient way of calculating the histogram of a masked array? numpy.histogram and pyplot.hist do count the masked elements, by default!

The only simple solution I can think of right now involves creating a new array with the non-masked value:

histogram(m_arr[~m_arr.mask])

This is not very efficient, though, as this unnecessarily creates a new array. I'd be happy to read about better ideas!

8
  • 2
    For what it's worth, this would probably be considered a bug in numpy.histogram. You should probably file a bug report and mention it on the mailing list. It's easily fixed by replacing asarray with asanyarray in the numpy.histogram sources. Commented Aug 31, 2010 at 14:55
  • Joe, you might want to submit your comment as an answer: I might well mark it as the accepted answer, if confirmed by the Numpy developers. Commented Sep 2, 2010 at 7:41
  • 3
    I sent out a quick question to the list. mail.scipy.org/pipermail/numpy-discussion/2010-September/… We'll see whether or not folks consider it a bug or not. It seems counter intuitive to me at the very least, though. Commented Sep 2, 2010 at 19:56
  • 2
    For what it's worth, the general consensus was that it was intended behavior, and that such a fix would probably cause more problems than it would fix. E.g.: mail.scipy.org/pipermail/numpy-discussion/2010-September/… Commented Sep 2, 2010 at 23:36
  • Thank you, Joe. Can you summarize your comments in an answer. I'd like to mark it as the accepted answer because it shows that there is nothing better than tillsten's good solution. Commented Sep 3, 2010 at 8:09

4 Answers 4

17

(Undeleting this as per discussion above...)

I'm not sure whether or not the numpy developers would consider this a bug or expected behavior. I asked on the mailing list, so I guess we'll see what they say.

Either way, it's an easy fix. Patching numpy/lib/function_base.py to use numpy.asanyarray rather than numpy.asarray on the inputs to the function will allow it to properly use masked arrays (or any other subclass of an ndarray) without creating a copy.

Edit: It seems like it is expected behavior. As discussed here:

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant, because I guess full masked array handling inside histogram will be a lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be necessary to figure out what the masked array definition of a histogram is, as Bruce pointed out.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. One of the arguments against handling masked arrays in histograms is that if histograms handled masked values, one would have to decide how masked data with a masked array of weights should be treated. I don't think that there is any obviously better solution to this problem: it looks like histogram()'s features do not mix too well with masked input+weight arrays.
10

Try hist(m_arr.compressed()).

1 Comment

This is a better idea than my m_arr[~m_arr.mask]. However, it does not solve the problem that a new array is unnecessarily corrected.
6

This is a super old question, but these days I just use:

numpy.histogram(m_arr, bins=.., range=.., density=False, weights=m_arr_mask)

Where m_arr_mask is an array with the same shape as m_arr, consisting of 0 values for elements of m_arr to be excluded from the histogram and 1 values for elements that are to be included.

4 Comments

Also, this won't work if you try to pass in a string for bins. Great answer aside from that.
I can't seem to make it work. When I pass a mask array for weights, the result does not seem consistent with the result I get without mask. I tried passing a mask of random 0 and 1 values, and expected the count in each bin to be divided by approx 2. But it gets divided by more than 20.
I works fine with a more trivial example (much smaller arrays). Looks like a numpy bug, maybe the weights cause numpy.histogram to use a uint8 array as output, which causes overflow.
0

After running into casting issues by trying Erik's solution (see https://github.com/numpy/numpy/issues/16616), I decided to write a numba function to achieve this behavior.

Some of the code was inspired by https://numba.pydata.org/numba-examples/examples/density_estimation/histogram/results.html. I added the mask bit.

import numpy
import numba  

@numba.jit(nopython=True)
def compute_bin(x, bin_edges):
    # assuming uniform bins for now
    n = bin_edges.shape[0] - 1
    a_min = bin_edges[0]
    a_max = bin_edges[-1]

    # special case to mirror NumPy behavior for last bin
    if x == a_max:
        return n - 1  # a_max always in last bin

    bin = int(n * (x - a_min) / (a_max - a_min))

    if bin < 0 or bin >= n:
        return None
    else:
        return bin


@numba.jit(nopython=True)
def masked_histogram(img, bin_edges, mask):
    hist = numpy.zeros(len(bin_edges) - 1, dtype=numpy.intp)

    for i, value in enumerate(img.flat):
        if mask.flat[i]:
            bin = compute_bin(value, bin_edges)
            if bin is not None:
                hist[int(bin)] += 1
    return hist  # , bin_edges

The speedup is significant. On a (1000, 1000) image:

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.