3

Background

I have one 1D NumPy array initialized with zeroes.

import numpy as np
section = np.zeros(1000)

Then I have a Pandas DataFrame where I have indices in two columns:

d= {'start': {0: 7200, 1: 7500, 2: 7560, 3: 8100, 4: 11400},
    'end': {0: 10800, 1: 8100, 2: 8100, 3: 8150, 4: 12000}}

df = pd.DataFrame(data=d, columns=['start', 'end'])

For each pair of indices, I want to set the value of the corresponding indices in the numpy array to True.

My current solution

I can do this by applying a function to the DataFrame:

def fill_array(row):
    section[row.start:row.end] = True

df.apply(fill_array, axis=1)

I want to vectorize this operation

This works as I expect, but for the fun of it, I would like to vectorize the operation. I'm not very proficient with this, and my searching online has not put me on the right track.

I would really appreciate any suggestions on how to make this into a vector operation, if at all possible.

2
  • How many start, end pairs would there be in your actual use case? Commented Jul 12, 2017 at 12:04
  • @Divakar Worst case 10 000 pairs, and a NumPy array of 1-3 million indices. Commented Jul 12, 2017 at 12:06

2 Answers 2

5

The trick for the implementation to follow is that we would put 1s at every start points and -1s at every end points on a zeros initialized int array. The actual trick comes next, as we would cumulatively sum it, giving us non-zero numbers for the positions covered by the bin (start-stop pair) boundaries. So, the final step is to look for non-zeros for a final output as a boolean array. Thus, we would have two vectorized solutions, with their implementations shown below -

def filled_array(start, end, length):
    out = np.zeros((length), dtype=int)
    np.add.at(out,start,1)
    np.add.at(out,end,-1)
    return out.cumsum()>0

def filled_array_v2(start, end, length): #Using @Daniel's suggestion
    out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
    return out.cumsum().astype(bool)

Sample run -

In [2]: start
Out[2]: array([ 4,  7,  5, 15])

In [3]: end
Out[3]: array([12, 12,  7, 17])

In [4]: out = filled_array(start, end, length=20)

In [7]: pd.DataFrame(out) # print as dataframe for easy verification
Out[7]: 
        0
0   False
1   False
2   False
3   False
4    True
5    True
6    True
7    True
8    True
9    True
10   True
11   True
12  False
13  False
14  False
15   True
16   True
17  False
18  False
19  False
Sign up to request clarification or add additional context in comments.

3 Comments

minor quibble: maxlen should probably be minlen, since if your zeros array is too short add.at will fail. Also for performance you could do out = np.bincount(start, minlength=minlen) - np.bincount(end, minlength=minlen) and return out.cumsum().astype(bool) since integers !=0 will resolve to TRUE without the comparison step
@DanielF Good points, thanks a lot! Edited. That name maxlen wasn't a correct one. Would leave that to OP to make sure the length to be mentioned as the input arg covers all indices.
Thanks @Divakar! This is a great answer. I like the algorithm you suggested. Clever! I measured it at 245 µs, and with the suggestions from @DanielF it came down to only 150 µs per loop.
1

Vectorization

You have already done the most important vectorization by using slice assignment, but you cannot fully vectorize this using slices since python does not support "multiple slices".

If you really badly want to use vectorization you can create an array with the "True" indices, like this

indices = np.r_[tuple(slice(row.start, row.end) for row in df.itertuples())]
section[indices] = True

But this will most likely be slower, since it creates a new temporary array with indices.

Removing duplicate work

With that said you could gain some speed-ups by reducing duplicate work. Specifically, you can take the union of the ranges, giving you a set of disjoint sets.

In your case, the first interval overlaps all except the last one, so your dataframe is equivalent to

d= {'start': {0: 7200, 1: 11400},
    'end': {0: 10800, 1: 12000}}

This reduces the amount of work by up to 60%! But first we need to find these intervals. Following the answer quoted above, we can do this by:

slices = [(row.start, row.end) for row in df.itertuples()]
slices_union = []
for start, end in sorted(slices):
    if slices_union and slices_union[-1][1] >= start - 1:
        slices_union[-1][1] = max(slices_union[-1][1], end)
    else:
        slices_union.append([start, end])

Then you can use these (hopefully much smaller slices) like this

for start, end in slices_union:
    section[start:end] = True

5 Comments

As it turns out, we can vectorize, just need to change the way we do it :)
Well he can, but i doubt it's worth it. I gave a solution in the end using np.r_, which I guess is the easiest solution.
Apologies for trying to be correct about terminologies, but a list/dataframe comprehension isn't a vectorized solution, specially when put in big headers :)
Thank you for your suggestion, @JonasAdler I had to make a small change to your code to get it to work. The original threw a AttributeError: 'str' object has no attribute 'start' error. It resolved easily by doing df.itertuples() instead of df. As you suspected, the vectorized version was a little bit slower than the iterative one, with 747 µs versus 649 µs with iteration. For comparison, my original function clocks in at 413 µs.
Updated according to your comment. Did you try the other trick?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.