Fill 1D numpy array from arrays with indices

Question

Background

I have one 1D NumPy array initialized with zeroes.

import numpy as np
section = np.zeros(1000)

Then I have a Pandas DataFrame where I have indices in two columns:

d= {'start': {0: 7200, 1: 7500, 2: 7560, 3: 8100, 4: 11400},
    'end': {0: 10800, 1: 8100, 2: 8100, 3: 8150, 4: 12000}}

df = pd.DataFrame(data=d, columns=['start', 'end'])

For each pair of indices, I want to set the value of the corresponding indices in the numpy array to True.

My current solution

I can do this by applying a function to the DataFrame:

def fill_array(row):
    section[row.start:row.end] = True

df.apply(fill_array, axis=1)

I want to vectorize this operation

This works as I expect, but for the fun of it, I would like to vectorize the operation. I'm not very proficient with this, and my searching online has not put me on the right track.

I would really appreciate any suggestions on how to make this into a vector operation, if at all possible.

How many start, end pairs would there be in your actual use case? — Divakar
– Divakar, Commented Jul 12, 2017 at 12:04
@Divakar Worst case 10 000 pairs, and a NumPy array of 1-3 million indices. — Knut Flage Henriksen
– Knut Flage Henriksen, Commented Jul 12, 2017 at 12:06

Divakar · Accepted Answer · 2017-07-12 13:12:46Z

5

The trick for the implementation to follow is that we would put 1s at every start points and -1s at every end points on a zeros initialized int array. The actual trick comes next, as we would cumulatively sum it, giving us non-zero numbers for the positions covered by the bin (start-stop pair) boundaries. So, the final step is to look for non-zeros for a final output as a boolean array. Thus, we would have two vectorized solutions, with their implementations shown below -

def filled_array(start, end, length):
    out = np.zeros((length), dtype=int)
    np.add.at(out,start,1)
    np.add.at(out,end,-1)
    return out.cumsum()>0

def filled_array_v2(start, end, length): #Using @Daniel's suggestion
    out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
    return out.cumsum().astype(bool)

Sample run -

In [2]: start
Out[2]: array([ 4,  7,  5, 15])

In [3]: end
Out[3]: array([12, 12,  7, 17])

In [4]: out = filled_array(start, end, length=20)

In [7]: pd.DataFrame(out) # print as dataframe for easy verification
Out[7]: 
        0
0   False
1   False
2   False
3   False
4    True
5    True
6    True
7    True
8    True
9    True
10   True
11   True
12  False
13  False
14  False
15   True
16   True
17  False
18  False
19  False

edited Jul 12, 2017 at 13:12

answered Jul 12, 2017 at 12:43

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Daniel F Over a year ago

minor quibble: maxlen should probably be minlen, since if your zeros array is too short add.at will fail. Also for performance you could do out = np.bincount(start, minlength=minlen) - np.bincount(end, minlength=minlen) and return out.cumsum().astype(bool) since integers !=0 will resolve to TRUE without the comparison step

Divakar Over a year ago

@DanielF Good points, thanks a lot! Edited. That name maxlen wasn't a correct one. Would leave that to OP to make sure the length to be mentioned as the input arg covers all indices.

Knut Flage Henriksen Over a year ago

Thanks @Divakar! This is a great answer. I like the algorithm you suggested. Clever! I measured it at 245 µs, and with the suggestions from @DanielF it came down to only 150 µs per loop.

Jonas Adler · Accepted Answer · 2017-07-13 13:07:10Z

1

Vectorization

You have already done the most important vectorization by using slice assignment, but you cannot fully vectorize this using slices since python does not support "multiple slices".

If you really badly want to use vectorization you can create an array with the "True" indices, like this

indices = np.r_[tuple(slice(row.start, row.end) for row in df.itertuples())]
section[indices] = True

But this will most likely be slower, since it creates a new temporary array with indices.

Removing duplicate work

With that said you could gain some speed-ups by reducing duplicate work. Specifically, you can take the union of the ranges, giving you a set of disjoint sets.

In your case, the first interval overlaps all except the last one, so your dataframe is equivalent to

d= {'start': {0: 7200, 1: 11400},
    'end': {0: 10800, 1: 12000}}

This reduces the amount of work by up to 60%! But first we need to find these intervals. Following the answer quoted above, we can do this by:

slices = [(row.start, row.end) for row in df.itertuples()]
slices_union = []
for start, end in sorted(slices):
    if slices_union and slices_union[-1][1] >= start - 1:
        slices_union[-1][1] = max(slices_union[-1][1], end)
    else:
        slices_union.append([start, end])

Then you can use these (hopefully much smaller slices) like this

for start, end in slices_union:
    section[start:end] = True

edited Jul 13, 2017 at 13:07

answered Jul 12, 2017 at 12:41

Jonas Adler

10.9k5 gold badges50 silver badges76 bronze badges

5 Comments

Divakar Over a year ago

As it turns out, we can vectorize, just need to change the way we do it :)

Jonas Adler Over a year ago

Well he can, but i doubt it's worth it. I gave a solution in the end using np.r_, which I guess is the easiest solution.

Divakar Over a year ago

Apologies for trying to be correct about terminologies, but a list/dataframe comprehension isn't a vectorized solution, specially when put in big headers :)

Knut Flage Henriksen Over a year ago

Thank you for your suggestion, @JonasAdler I had to make a small change to your code to get it to work. The original threw a AttributeError: 'str' object has no attribute 'start' error. It resolved easily by doing df.itertuples() instead of df. As you suspected, the vectorized version was a little bit slower than the iterative one, with 747 µs versus 649 µs with iteration. For comparison, my original function clocks in at 413 µs.

Jonas Adler Over a year ago

Updated according to your comment. Did you try the other trick?

Collectives™ on Stack Overflow

Fill 1D numpy array from arrays with indices

2 Answers 2

3 Comments

Vectorization

Removing duplicate work

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Vectorization

Removing duplicate work

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related