7

Say I have a Numpy vector,

A = zeros(100)

and I divide it into subvectors by a list of breakpoints which index into A, for instance,

breaks = linspace(0, 100, 11, dtype=int)

So the i-th subvector would be lie between the indices breaks[i] (inclusive) and breaks[i+1] (exclusive). The breaks are not necessarily equispaced, this is only an example. However, they will always be strictly increasing.

Now I want to operate on these subvectors. For instance, if I want to set all elements of the i-th subvector to i, I might do:

for i in range(len(breaks) - 1):
    A[breaks[i] : breaks[i+1]] = i

Or I might want to compute the subvector means:

b = empty(len(breaks) - 1)
for i in range(len(breaks) - 1):
    b = A[breaks[i] : breaks[i+1]].mean()

And so on.

How can I avoid using for loops and instead vectorize these operations?

4
  • Is breaks pre-sorted? Commented Apr 27, 2015 at 11:32
  • @Divakar: Yes, they are strictly increasing. Commented Apr 27, 2015 at 11:36
  • Also, would the limits of breaks cover the entire A, i.e. could there be some elements of A that won't be changed after this operation? Commented Apr 27, 2015 at 11:36
  • @Divakar: Yes, they will cover all of it. Commented Apr 27, 2015 at 11:37

3 Answers 3

7

You can use simple np.cumsum -

import numpy as np

# Form zeros array of same size as input array and 
# place ones at positions where intervals change
A1 = np.zeros_like(A)
A1[breaks[1:-1]] = 1

# Perform cumsum along it to create a staircase like array, as the final output
out = A1.cumsum()

Sample run -

In [115]: A
Out[115]: array([3, 8, 0, 4, 6, 4, 8, 0, 2, 7, 4, 9, 3, 7, 3, 8, 6, 7, 1, 6])

In [116]: breaks
Out[116]: array([ 0,  4,  9, 11, 18, 20])

In [142]: out
Out[142]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4]..)

If you want to have mean values of those subvectors from A, you can use np.bincount -

mean_vals = np.bincount(out, weights=A)/np.bincount(out)

If you are looking to extend this functionality and use a custom function instead, you might want to look into MATLAB's accumarray equivalent for Python/Numpy: numpy_groupies whose source code is available here.

Sign up to request clarification or add additional context in comments.

4 Comments

I like your method; it is faster than mine. You could also use A1 = np.zeros(breaks[-1]).
This solves the simple use case of setting each subvector to be constant (which was meant as an example). What if, for example, I want to compute the mean of each subvector?
Your first method is what np.unique with the optional return_inverse argument uses to do its thing. Nice!
@Jaime Yeah I guess its equivalent to that for sorted array. In MATLAB, we have a simliar thing with unique with its third output argument. It's nice to see all these similarities!
6

There really isn't a single answer to your question, but several techniques that you can use as building blocks. Another one you may find helpful:

All numpy ufuncs have a .reduceat method, which you can use to your advantage for some of your calculations:

>>> a = np.arange(100)
>>> breaks = np.linspace(0, 100, 11, dtype=np.intp)
>>> counts = np.diff(breaks)
>>> counts
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
>>> sums = np.add.reduceat(a, breaks[:-1], dtype=np.float)
>>> sums
array([  45.,  145.,  245.,  345.,  445.,  545.,  645.,  745.,  845.,  945.])
>>> sums / counts  # i.e. the mean
array([  4.5,  14.5,  24.5,  34.5,  44.5,  54.5,  64.5,  74.5,  84.5,  94.5])

1 Comment

This was the closest to what I was looking for, but really I just want the slices used by reduceat--I don't actually want to reduce these slices! For that matter, I don't even have a ufunc. Is there something that would simply return the slices computed by reduceat and do so in a vectorized way? In my case all of the slices are the same length.
3

You could use np.repeat:

In [35]: np.repeat(np.arange(0, len(breaks)-1), np.diff(breaks))
Out[35]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
       9, 9, 9, 9, 9, 9, 9, 9])

To compute arbitrary binned statistics you could use scipy.stats.binned_statistic:

import numpy as np
import scipy.stats as stats

breaks = np.linspace(0, 100, 11, dtype=int)
A = np.random.random(100)

means, bin_edges, binnumber = stats.binned_statistic(
    x=np.arange(len(A)), values=A, statistic='mean', bins=breaks)

stats.binned_statistic can compute means, medians, counts, sums; or, to compute an arbitrary statistics for each bin, you can pass a callable to the statistic parameter:

def func(values):
    return values.mean()

funcmeans, bin_edges, binnumber = stats.binned_statistic(
    x=np.arange(len(A)), values=A, statistic=func, bins=breaks)

assert np.allclose(means, funcmeans)

1 Comment

But how will I now set the i-th part to i while avoiding a for loop?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.