Efficiently determine indices of middle 0's for consecutive 0's in an array

Question

Given an array consisting of non-negative numbers, I want to be able to find the index of the middle 0 for each of the consecutive 0's in the array, excluding 0's that are found at the start or the end of the array. A simple example: we have given array

np.array([0, 0, 0, 1, 2, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 5, 6, 0, 0, 0])

Then, I want to return indices 6 and 13 (for an even amount of consecutive 0's, I want either the floor or ceil).

Now, my current implementation is as follows:

def middle_zero(array):
    busy = False
    not_start_zero = False

    middle_zero_list = []

    for i in range(len(array)):
        if array[i] > 0:
            not_start_zero = True

        if array[i] == 0 and not_start_zero and not busy:
            start = i
            busy = True
        if busy and array[i] > 0:
            end = i
            middle_zero_list.append(int(np.mean([start, end])))
            busy = False

    return np.array(middle_zero_list)

middle_zero(np.array([0, 0, 0, 1, 2, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 5, 6, 0, 0, 0]))
>> array([6, 13])

Although it is working, I think that a much more clever solution is possible here. As I have to do this computation many times, I want it to be more efficient.

Seems like a great usecase for numba.njit. Have you tried that? — dsm
– dsm, Commented Jul 12, 2022 at 16:12
@MichaelSzczesny I forgot to mention that I only want to consider consecutive zeros that are not at the start and not at the end of the array. When those are left out of consideration, is there still a bug? — TheEreStack
– TheEreStack, Commented Jul 12, 2022 at 16:27
does it deal with negative numbers as required? eg middle_zero(np.array([0, 1, 2, 0, 0, 0, -3, -2, 0, 0, 0, 0, 0, 0, 5, 6, 0, 0])) gives array([8]) so it's not the middle of zeros, it's the middle of non positives — Nin17
– Nin17, Commented Jul 12, 2022 at 16:35
Good point @Nin17. I did not mention that I only consider arrays consisting of non-negative numbers. — TheEreStack
– TheEreStack, Commented Jul 12, 2022 at 16:46

Nin17 · Accepted Answer · 2022-07-12 18:27:22Z

3

Note this doesn't include solitary non-positives, ie middle_zero(np.array([1,0,2])) gives array([], dtype=float64)

If you want a numpy only solution, you could do this (which shows significant speed improvements if the input array is large though it's slower than @DominikStańczak answer using numba (actually it depends on input - see below) - you can also use numba with this though it doesn't speed it up much):

import numpy as np
def middle_zero(array):
    argwhere = np.argwhere(array > 0).ravel()

    diff = np.ediff1d(argwhere)
    result = argwhere[:-1]+diff//2 # Or argwhere[:-1]+np.ceil(diff/2) for consistency with OP

    return result[result != argwhere[:-1]] #Or result[result-1 != argwhere[:-1]] for consistency with OP

Output with given example:

array([ 6, 12])

Edit timings: with input as np.array([0, 0, 0, 1, 2, 0, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 5, 6, 0, 0, 0])

25.1 µs ± 227 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) #OP
11.8 µs ± 30.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) #Nin17
1.18 µs ± 3.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #Dominik

With input as np.random.default_rng().integers(0, 2, size=100000)

225 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) #OP
833 µs ± 3.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) #Nin17
481 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) #Dominik

If you have long regions of consecutive non-positive numbers, this is actually the fastest, with an input of np.array([1,*[0]*1000,1]*1000):

539 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) #OP
432 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) #Nin17
1.93 ms ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) #Dominik

edited Jul 12, 2022 at 18:27

answered Jul 12, 2022 at 17:26

Nin17

3,6022 gold badges7 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dsm Over a year ago

Awesome, nice to see a numpy solution, and that it scales better than numba, as expected! Can I ask why did you go with ediff1d vs diff here?

Nin17 Over a year ago

I used ediff1d instead of diff because diff isn't supported by numba so you can also njit this to make it slightly faster. I think it works the same with diff without numba.

dsm · Accepted Answer · 2022-07-12 16:17:24Z

2

Simply wrapping your function in numba.njit (replacing the np.mean with a direct computation) gets this a nice speedup:

[nav] In [8]: @numba.njit
         ...: def middle_zero2(array):
         ...:     busy = False
         ...:     not_start_zero = False
         ...:
         ...:     middle_zero_list = []
         ...:
         ...:     for i in range(len(array)):
         ...:         if array[i] > 0:
         ...:             not_start_zero = True
         ...:
         ...:         if array[i] == 0 and not_start_zero and not busy:
         ...:             start = i
         ...:             busy = True
         ...:         if busy and array[i] > 0:
         ...:             end = i
         ...:             middle_zero_list.append(int((start+end)/2))
         ...:             busy = False
         ...:
         ...:     return np.array(middle_zero_list)

[ins] In [10]: %timeit middle_zero(a)
24.3 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# skipped a cache run

[ins] In [12]: %timeit middle_zero2(a)
763 ns ± 20.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

about a factor of 30 of a speedup on your small example test case.

answered Jul 12, 2022 at 16:17

dsm

2,2631 gold badge19 silver badges29 bronze badges

1 Comment

TheEreStack Over a year ago

Thanks for your help Dominik! Although this already results in a speedup, I think that there might be a more clever solution possible (e.g. leaving out the for loop). Do you also have any suggestions on that?

Colim · Accepted Answer · 2022-07-12 17:19:42Z

0

I don't know if this is faster, but should be metered. It should be faster on a compiled language, but python is not.

def middle_zero(array):
    busy = False
    not_start_zero = False
    start=None
    end=None

    middle_zero_list = []

    for i, value in enumerate(array):
        isZero = value == 0
        not_start_zero = [True, not_start_zero][isZero]

        start, busy = [[start, busy], [i, True]][isZero and not_start_zero and not busy]
        if busy and not isZero:
            end = i
            middle_zero_list.append(int(np.mean([start, end])))
            busy = False

    return np.array(middle_zero_list)

answered Jul 12, 2022 at 17:19

Colim

1,3135 silver badges11 bronze badges

1 Comment

Nin17 Over a year ago

This is slower than OPs

Collectives™ on Stack Overflow

Efficiently determine indices of middle 0's for consecutive 0's in an array

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related