How can I define multiple slices of a numpy array based on pairs of start/end indices without iterating?

Question

I have a numpy array of integers.

I have two other arrays representing the start and length (or it could be start and end) indices into this array that identify sequences of integers that I need to process. The sequences are variable length.

x=numpy.array([2,3,5,7,9,12,15,21,27,101, 250]) #Can have length of millions

starts=numpy.array([2,7]) # Can have lengths of thousands
ends=numpy.array([5,9])

# required output is x[2:5],x[7:9] in flat 1D array 
# [5,7,9,12,21,27,101]

I can do this easily with for loops but the application is performance sensitive so I'm looking for a way to do it without Python iteration.

Any help will be gratefully received!

Doug

Could there be overlaps? As in x[2:5],x[3:9] etc? If so, would be the overlaps be included as many times as they occur? — Divakar
– Divakar, Commented Dec 8, 2019 at 17:41
@scotsman60 So, would the expected output be x[2:5] + x[3:9] or x[2:5] + x[5:9]? — Divakar
– Divakar, Commented Dec 9, 2019 at 7:21

Divakar · Accepted Answer · 2019-12-08 18:03:38Z

2

Approach #1

One vectorized approach would be with masking created off with broadcasting -

In [16]: r = np.arange(len(x))

In [18]: x[((r>=starts[:,None]) & (r<ends[:,None])).any(0)]
Out[18]: array([ 5,  7,  9, 21, 27])

Approach #2

Another vectorized way would be with creating ramps of 1s and 0s with cumsum (should be better with many start-end pairs), like so -

idx = np.zeros(len(x),dtype=int)
idx[starts] = 1
idx[ends[ends<len(x)]] = -1
out = x[idx.cumsum().astype(bool)]

Approach #3

Another loop-based one to achieve memory-efficiency, could be better with many entries in starts,ends pairs -

mask = np.zeros(len(x),dtype=bool)
for (i,j) in zip(starts,ends):
    mask[i:j] = True
out = x[mask]

Approach #4

For completeness, here's another with loop to select slices and then assign into an initialized array and should be good on slices to be selected off a large array -

lens = ends-starts
out = np.empty(lens.sum(),dtype=x.dtype)
start = 0
for (i,j,l) in zip(starts,ends,lens):
    out[start:start+l] = x[i:j]
    start += l

If the iterations are a lot, there's a minor optimization possible to reduce compute per iteration -

lens = ends-starts
lims = np.r_[0,lens].cumsum()
out = np.empty(lims[-1],dtype=x.dtype)
for (i,j,s,t) in zip(starts,ends,lims[:-1],lims[1:]):
    out[s:t] = x[i:j]

edited Dec 8, 2019 at 18:03

answered Dec 8, 2019 at 17:37

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

scotsman60 Over a year ago

Thanks @Divakar!!! The first solution is exactly what I was looking for. I can conditionally switch to one of the others if I run short of memory but - but if I run short of memory on this operation I have bigger problems elsewhere in my application....

Collectives™ on Stack Overflow

How can I define multiple slices of a numpy array based on pairs of start/end indices without iterating?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related