subsetting 2d numpy array and keeping rows consistent

Question

I was wondering what the simplest method for doing the following is:

Suppose we have the following 2d arrays:

>>> a = np.array([['z', 'z', 'z', 'f', 'z','f', 'f'], ['z', 'z', 'z', 'f', 'z','f', 'f']])

array([['z', 'z', 'z', 'f', 'z', 'f', 'f'],
   ['z', 'z', 'z', 'f', 'z', 'f', 'f']],
  dtype='<U1')



>>> b = np.array(range(0,14)).reshape(2, -1)


array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12, 13]])


>>> idxs = list(zip(*np.where(a == 'f')))

[(0, 3), (0, 5), (0, 6), (1, 3), (1, 5), (1, 6)]


>>> [b[x] for x in idxs]

[3, 5, 6, 10, 12, 13]

However, I would like to keep the structure that was there before with regard to the first index or rows - i.e. :

[[3, 5, 6], [7, 11]]

Is there a way to keep this structure easily?

That's a mix of length 3 and length 2 lists; it can't be a 2d array. — hpaulj
– hpaulj, Commented Aug 19, 2017 at 2:05
@hpaulj yes it would end up being a list of lists, it can't be a numpy array at the end — chase
– chase, Commented Aug 19, 2017 at 2:08

MSeifert · Accepted Answer · 2017-08-19 02:53:11Z

This is a more complicated, but pure NumPy, solution:

Get the indices (in a flattened version of a) where it's an 'f'.
Get the indices where a new row begins
Find the indices in the array from 1 which belong to one row
Split the array at these indices.

The code would look like this:

>>> indices = np.flatnonzero(a.ravel() == 'f')
>>> rows = np.arange(1, a.shape[0])*a.shape[1]
>>> np.split(b.ravel()[indices], np.searchsorted(indices, rows))
[array([3, 5, 6], dtype=int64), array([10, 12, 13], dtype=int64)]

A bit longer than the other solutions and I'm not sure if it will be faster ¹.

Although, personally, I would go with a list comprehension and a zip:

[b_row[a_row] for a_row, b_row in zip(a == 'f', b)]

It's much shorter and according to my timings quite performant.

Timing:

import numpy as np
a = np.array([['z', 'z', 'z', 'f', 'z','f', 'f']]*10000)
b = np.arange(a.size).reshape(-1, a.shape[1])

%%timeit

indices = np.flatnonzero(a.ravel() == 'f')
rows = np.arange(1, a.shape[0])*a.shape[1]
np.split(b.ravel()[indices], np.searchsorted(indices, rows))

123 ms ± 8.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit [b[i][a[i] == 'f'] for i in range(len(a))]

162 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But a lot slower compared to my suggestion at Psidoms answer:

%timeit [b_row[a_row] for a_row, b_row in zip(a == 'f', b)]

44.9 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

akuiper · Accepted Answer · 2017-08-19 02:12:22Z

2

Use a for loop:

[b[i][a[i] == 'f'] for i in range(len(a))]
# [array([3, 5, 6]), array([10, 12, 13])]

answered Aug 19, 2017 at 2:12

akuiper

216k33 gold badges362 silver badges379 bronze badges

2 Comments

MSeifert Over a year ago

or with zip: [b_row[a_row == 'f'] for a_row, b_row in zip(a, b)]. You could even go a step further and do the comparison outside of the loop: [b_row[a_row] for a_row, b_row in zip(a == 'f', b)] (that could be a bit faster).

akuiper Over a year ago

@MSeifert Nice thought on the second option. I can see a speed up there.

Will · Accepted Answer · 2017-08-19 02:16:19Z

1

a = np.array([['z', 'z', 'z', 'f', 'z','f', 'f'], ['z', 'z', 'z', 'f', 'z','f', 'f']])

b = np.array(range(0,14)).reshape(2, -1)

idxs = list(zip(*np.where(a == 'f')))


c=[[],[]]
for x in idxs:
    c[x[0]].append(b[x])

print c

answered Aug 19, 2017 at 2:16

Will

8421 gold badge7 silver badges23 bronze badges

Comments

hpaulj · Accepted Answer · 2017-08-19 02:36:49Z

In [89]: idx = np.where(a == 'f')
In [90]: idx
Out[90]: 
(array([0, 0, 0, 1, 1, 1], dtype=int32),
 array([3, 5, 6, 3, 5, 6], dtype=int32))

We can apply the where tuple to select items in b:

In [93]: b[idx]
Out[93]: array([ 3,  5,  6, 10, 12, 13])

Equivalently apply the boolean mask:

In [94]: b[a == 'f']
Out[94]: array([ 3,  5,  6, 10, 12, 13])

np.argwhere takes the transpose of where, producing a 2d array like your idxs.

In [95]: np.argwhere(a == 'f')
Out[95]: 
array([[0, 3],
       [0, 5],
       [0, 6],
       [1, 3],
       [1, 5],
       [1, 6]], dtype=int32)

As noted in Delete all elements in an array corresponding to Boolean mask, we can't, in general, select elements with a mask, and retain some sort of 2d structure. In selected cases we can reshape the 1d result into something meaningful.

In [96]: b[idx].reshape(2,-1)
Out[96]: 
array([[ 3,  5,  6],
       [10, 12, 13]])

An easy way to collect these values on a row by row basis, and allowing for differing size results in each row, would be to iterate:

In [100]: [j[i=='f'] for i,j in zip(a,b)]
Out[100]: [array([3, 5, 6]), array([10, 12, 13])]
In [101]: [j[i=='f'].tolist() for i,j in zip(a,b)]
Out[101]: [[3, 5, 6], [10, 12, 13]]

Collectives™ on Stack Overflow

subsetting 2d numpy array and keeping rows consistent

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related