Determine blocks in sorted numpy integer array

Question

I have a sorted integer array, e.g., [0, 0, 1, 1, 1, 2, 4, 4], and I would like to determine where the integer blocks start and how long the blocks are. The block sizes are small but the array itself can be very large, so efficiency is important. The total number of blocks is also known.

numpy.unique does the trick:

import numpy


a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)

_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)

print(idx_start)
print(count)

[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]

but is slow. I would assume that, given the specific structure of the input array, there's a more efficient solution.

For example, something as simple as

import numpy

a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4


k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
    if z == a[k]:
        count += 1
    else:
        z = a[k]
        counts[block_idx] = count
        count = 1
        block_idx += 1
    k += 1
counts[block_idx] = count

print(counts)

gives the block sizes, and a simple numpy.cumsum would give index_start. Using a Python loop is slow of course.

Any hints?

This looks more like a solution than a question. What is your question? — jpp
– jpp, Commented May 17, 2018 at 13:39
I think OP has a point that given the information that the list is sorted, perhaps there is a way we can do it faster — sshashank124
– sshashank124, Commented May 17, 2018 at 13:47

Divakar · Accepted Answer · 2018-05-17 15:07:06Z

Here's one with some masking and slicing -

def grp_start_len(a):
    m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
    idx = np.flatnonzero(m)
    return idx[:-1], np.diff(idx)

Sample run -

In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])

In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))

Timings (setup from @AGN Gazer's solution) -

In [24]: np.random.seed(0)

In [25]: a = np.sort(np.random.randint(1, 10000, 10000))

In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop

# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop

In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop

Bumping up the sizes 10x more -

In [40]: np.random.seed(0)

In [41]: a = np.sort(np.random.randint(1, 100000, 100000))

In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
    ...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
    ...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop

AGN Gazer · Accepted Answer · 2018-05-17 14:33:54Z

np.where(np.ediff1d(a, None, a[0]))[0]

If you want to have the first "0" as in your answer, add a non-zero number to a[0]:

np.where(np.ediff1d(a, None, a[0] + 1))[0]

EDIT (Block length):

Ah, just noticed that you also want to get block length. Then, modify the above code:

st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)

Then,

>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]

EDIT 2 (Timing tests)

In [69]: a = np.sort(np.random.randint(1, 10000, 10000))

In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Collectives™ on Stack Overflow

Determine blocks in sorted numpy integer array

2 Answers 2

Comments

EDIT (Block length):

EDIT 2 (Timing tests)

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

EDIT (Block length):

EDIT 2 (Timing tests)

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related