4

I have a sorted integer array, e.g., [0, 0, 1, 1, 1, 2, 4, 4], and I would like to determine where the integer blocks start and how long the blocks are. The block sizes are small but the array itself can be very large, so efficiency is important. The total number of blocks is also known.

numpy.unique does the trick:

import numpy


a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)

_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)

print(idx_start)
print(count)
[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]

but is slow. I would assume that, given the specific structure of the input array, there's a more efficient solution.

For example, something as simple as

import numpy

a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4


k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
    if z == a[k]:
        count += 1
    else:
        z = a[k]
        counts[block_idx] = count
        count = 1
        block_idx += 1
    k += 1
counts[block_idx] = count

print(counts)

gives the block sizes, and a simple numpy.cumsum would give index_start. Using a Python loop is slow of course.

Any hints?

5
  • 2
    Doesn't seem overkill to me. That's the point of a library Commented May 17, 2018 at 13:37
  • This looks more like a solution than a question. What is your question? Commented May 17, 2018 at 13:39
  • Thanks for the comments. I clarified the question. Commented May 17, 2018 at 13:42
  • I doubt we can do better than numpy Commented May 17, 2018 at 13:45
  • 1
    I think OP has a point that given the information that the list is sorted, perhaps there is a way we can do it faster Commented May 17, 2018 at 13:47

2 Answers 2

4

Here's one with some masking and slicing -

def grp_start_len(a):
    m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
    idx = np.flatnonzero(m)
    return idx[:-1], np.diff(idx)

Sample run -

In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])

In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))

Timings (setup from @AGN Gazer's solution) -

In [24]: np.random.seed(0)

In [25]: a = np.sort(np.random.randint(1, 10000, 10000))

In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop

# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop

In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop

Bumping up the sizes 10x more -

In [40]: np.random.seed(0)

In [41]: a = np.sort(np.random.randint(1, 100000, 100000))

In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
    ...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
    ...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop
Sign up to request clarification or add additional context in comments.

Comments

3
np.where(np.ediff1d(a, None, a[0]))[0]

If you want to have the first "0" as in your answer, add a non-zero number to a[0]:

np.where(np.ediff1d(a, None, a[0] + 1))[0]

EDIT (Block length):

Ah, just noticed that you also want to get block length. Then, modify the above code:

st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)

Then,

>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]

EDIT 2 (Timing tests)

In [69]: a = np.sort(np.random.randint(1, 10000, 10000))

In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.