NumPy - Faster Operations on Masked Array?

Question

I have a numpy array:

import numpy as np
arr = np.random.rand(100)

If I want to find its maximum value, I run np.amax which runs 155,357 times a second on my machine.

However, for some reasons, I have to mask some of its values. Lets, for example, mask just one cell:

import numpy.ma as ma
arr = ma.masked_array(arr, mask=[0]*99 + [1])

Now, finding the max is much slower, running 26,574 times a second.

This is only 17% of the speed of this operation on a none-masked array.

Other operations, for example, are the subtract, add, and multiply. Although on a masked array they operate on ALL OF THE VALUES, it is only 4% of the speed compared to a none-masked array (15,343/497,663)

I'm looking for a faster way to operate on masked arrays like this, whether its using numpy or not.

(I need to run this on real data, which is arrays with multiple dimensions, and millions of cells)

hpaulj · Accepted Answer · 2020-03-14 17:36:54Z

MaskedArray is a subclass of the base numpy ndarray. It does not have compiled code of its own. Look at the numpy/ma/ directory for details, or the main file:

/usr/local/lib/python3.6/dist-packages/numpy/ma/core.py

A masked array has to key attributes, data and mask, one is the data array you used to create it, the other a boolean array of the same size.

So all operations have to take those two arrays into account. Not only does it calculate new data, it also has to calculate a new mask.

It can take several approaches (depending on the operation):

use the data as is
use compressed data - a new array with the masked values removed
use filled data, where the masked values are replaced by the fillvalue or some innocuous value (e.g. 0 when doing addition, 1 when doing multiplication).

The number of masked values, 0 or all, makes little, if any, difference is speed.

So the speed differences that you see are not surprising. There's a lot of extra calculation going on. The ma.core.py file says this package was first developed in pre-numpy days, and incorporated into numpy around 2005. While there have been changes to keep it up to date, I don't think it has been significantly reworked.

Here's the code for np.ma.max method:

def max(self, axis=None, out=None, fill_value=None, keepdims=np._NoValue):

    kwargs = {} if keepdims is np._NoValue else {'keepdims': keepdims}

    _mask = self._mask
    newmask = _check_mask_axis(_mask, axis, **kwargs)
    if fill_value is None:
        fill_value = maximum_fill_value(self)
    # No explicit output
    if out is None:
        result = self.filled(fill_value).max(
            axis=axis, out=out, **kwargs).view(type(self))
        if result.ndim:
            # Set the mask
            result.__setmask__(newmask)
            # Get rid of Infs
            if newmask.ndim:
                np.copyto(result, result.fill_value, where=newmask)
        elif newmask:
            result = masked
        return result
    # Explicit output
    ....

The key steps are

fill_value = maximum_fill_value(self)  # depends on dtype
self.filled(fill_value).max(
            axis=axis, out=out, **kwargs).view(type(self))

You can experiment with filled to see what happens with your array.

In [40]: arr = np.arange(10.)                                                                                        
In [41]: arr                                                                                                         
Out[41]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [42]: Marr = np.ma.masked_array(arr, mask=[0]*9 + [1])                                                            
In [43]: Marr                                                                                                        
Out[43]: 
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, --],
             mask=[False, False, False, False, False, False, False, False,
                   False,  True],
       fill_value=1e+20)
In [44]: np.ma.maximum_fill_value(Marr)                                                                              
Out[44]: -inf
In [45]: Marr.filled()                                                                                               
Out[45]: 
array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
       8.e+00, 1.e+20])
In [46]: Marr.filled(_44)                                                                                            
Out[46]: array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8., -inf])
In [47]: arr.max()                                                                                                   
Out[47]: 9.0
In [48]: Marr.max()                                                                                                  
Out[48]: 8.0

Collectives™ on Stack Overflow

NumPy - Faster Operations on Masked Array?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related