Efficient numpy subarrays extraction from a mask

Question

I am searching a pythonic way to extract multiple subarrays from a given array using a mask as shown in the example:

a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])

The output will be a collection of array like the following, where only the contiguous "region" of True values (True values next to each other) of the mask m represent the indices generating a subarray.

L[0] = np.array([10, 5])
L[1] = np.array([2, 1])

One other approach is by using scipy.ndimage.measurements.label as suggested in stackoverflow.com/questions/9440921/… — TommasoF
– TommasoF, Commented Apr 13, 2017 at 7:44

Divakar · Accepted Answer · 2017-04-13 07:44:36Z

3

Here's one approach -

def separate_regions(a, m):
    m0 = np.concatenate(( [False], m, [False] ))
    idx = np.flatnonzero(m0[1:] != m0[:-1])
    return [a[idx[i]:idx[i+1]] for i in range(0,len(idx),2)]

Sample run -

In [41]: a = np.array([10, 5, 3, 2, 1])
    ...: m = np.array([True, True, False, True, True])
    ...: 

In [42]: separate_regions(a, m)
Out[42]: [array([10,  5]), array([2, 1])]

Runtime test

Other approach(es) -

# @kazemakase's soln
def zip_split(a, m):
    d = np.diff(m)
    cuts = np.flatnonzero(d) + 1

    asplit = np.split(a, cuts)
    msplit = np.split(m, cuts)

    L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]
    return L

Timings -

In [49]: a = np.random.randint(0,9,(100000))

In [50]: m = np.random.rand(100000)>0.2

# @kazemakase's's solution
In [51]: %timeit zip_split(a,m)
10 loops, best of 3: 114 ms per loop

# @Daniel Forsman's solution
In [52]: %timeit splitByBool(a,m)
10 loops, best of 3: 25.1 ms per loop

# Proposed in this post
In [53]: %timeit separate_regions(a, m)
100 loops, best of 3: 5.01 ms per loop

Increasing the average length of islands -

In [58]: a = np.random.randint(0,9,(100000))

In [59]: m = np.random.rand(100000)>0.1

In [60]: %timeit zip_split(a,m)
10 loops, best of 3: 64.3 ms per loop

In [61]: %timeit splitByBool(a,m)
100 loops, best of 3: 14 ms per loop

In [62]: %timeit separate_regions(a, m)
100 loops, best of 3: 2.85 ms per loop

edited Apr 13, 2017 at 7:44

answered Apr 13, 2017 at 7:37

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TommasoF Over a year ago

I accept this answer because it provides the comparisons with the other methods discussed, furthermore a faster method. Thank you!

Mad Physicist Over a year ago

Fun fact: I just found out that np.r_[False, m, False] is 5-10x slower than np.concatenate(([False], m, [False])).

Daniel F · Accepted Answer · 2017-04-13 07:33:07Z

2

def splitByBool(a, m):
    if m[0]:
        return np.split(a, np.nonzero(np.diff(m))[0] + 1)[::2]
    else:
        return np.split(a, np.nonzero(np.diff(m))[0] + 1)[1::2]

This will return a list of arrays, split into chunks of True in m

answered Apr 13, 2017 at 7:33

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

4 Comments

MB-F Over a year ago

Nice solution. Makes use of the fact that True and False segments are necessarily alternating.

Mad Physicist Over a year ago

I like this solution because it can be turned into a one-liner: np.split(a, np.nonzero(np.diff(m))[0] + 1)[1 - m[0]::2]

Daniel F Over a year ago

Or even np.split(a, np.flatnonzero(np.diff(m)) + 1)[1 - m[0]::2)], which is a bit more readable

Mad Physicist Over a year ago

Interesting, each one liner is slower than the last :) But easier to read, as you said.

MB-F · Accepted Answer · 2017-04-13 07:34:02Z

1

Sounds like a natural application for np.split.

You first have to figure out where to cut the array, which is where the mask changes between True and False. Next discard all elements where the mask is False.

a = np.array([10, 5, 3, 2, 1])
m = np.array([True, True, False, True, True])

d = np.diff(m)
cuts = np.flatnonzero(d) + 1

asplit = np.split(a, cuts)
msplit = np.split(m, cuts)

L = [aseg for aseg, mseg in zip(asplit, msplit) if np.all(mseg)]

print(L[0])  # [10  5]
print(L[1])  # [2 1]

answered Apr 13, 2017 at 7:34

MB-F

23.8k5 gold badges71 silver badges127 bronze badges

Collectives™ on Stack Overflow

Efficient numpy subarrays extraction from a mask

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related