1

I have a huge matrix of values and i want to distribute them on a grid and compute the mean of each box of the grid. For the moment I use a loop on all values but I am looking for a vectorized way to treat it to reduce execution time.

import numpy as np    

values = np.arange(0,1000)

ind_x = (values/10)%3
ind_y = values%3

box_sum = np.zeros((3,3))
box_nb = np.zeros((3,3))

for v in range(0,len(values)):
    box_sum[ind_x[v],ind_y[v]] += values[v] 
    box_nb[ind_x[v],ind_y[v]] += 1

box_mean = np.divide(box_sum,box_nb)

In this exemple ind_x and ind_y are built arithmetically, but in the application it may be random values. Any idea ?

2 Answers 2

2

You can use np.bincount, like so -

id = ind_x*3 + ind_y # Generate 1D linear index IDs for use with bincount

box_sum = np.bincount(id,values,minlength=9).reshape(3,3)
box_nb = np.bincount(id,minlength=9).reshape(3,3)

Sample run -

1) Setup inputs and run original code :

In [59]: # Let's use random numbers to test out variety as also OP states : 
         # "..  in the application it may be random values"
    ...: values = np.random.randint(0,1000,(1000))
    ...: 
    ...: # Rest of the code same as the one posted within the question
    ...: ind_x = (values/10)%3
    ...: ind_y = values%3
    ...: 
    ...: box_sum = np.zeros((3,3))
    ...: box_nb = np.zeros((3,3))
    ...: 
    ...: for v in range(0,len(values)):
    ...:     box_sum[ind_x[v],ind_y[v]] += values[v] 
    ...:     box_nb[ind_x[v],ind_y[v]] += 1
    ...:     

In [60]: box_sum
Out[60]: 
array([[ 64875.,  50268.,  50496.],
       [ 48759.,  61661.,  53575.],
       [ 53076.,  48529.,  76576.]])

In [61]: box_nb
Out[61]: 
array([[ 125.,  105.,   96.],
       [  97.,  116.,  116.],
       [  96.,  100.,  149.]])

2) Use proposed approach and thus verify results :

In [62]: id = ind_x*3 + ind_y

In [63]: np.bincount(id,values,minlength=9).reshape(3,3)
Out[63]: 
array([[ 64875.,  50268.,  50496.],
       [ 48759.,  61661.,  53575.],
       [ 53076.,  48529.,  76576.]])

In [64]: np.bincount(id,minlength=9).reshape(3,3)
Out[64]: 
array([[125, 105,  96],
       [ 97, 116, 116],
       [ 96, 100, 149]])
Sign up to request clarification or add additional context in comments.

2 Comments

It's perfect ! The save in time is real. Thanks a lot Divakar.
@Vince np.bincount is one of the fastest tools in NumPy! So, not surprised at all :)
1

The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in an efficient manner:

import numpy_indexed as npi
(unique_x, unique_y), mean = npi.group_by((idx_x, idx_y)).mean(values)

I suspect the bincount solution is faster for a use case of a relatively dense grid; because this operates on a sparse grid (what you get back is a tuple of arrays of indices where the mean is computed, and a matching list of means); but that may be a huge advantage if your grid is infact quite sparse(as you say, the idx are 'random', or at least not as structured in practice).

Also, this is more flexible; group_by allows you to compute a variety of statistics, for keys of various dtypes and value arrays of higher dimensions.

2 Comments

Thanks for the answer but I need a robust program that I can give to my colleagues easily.
pip install numpy-indexed not easy enough? :). I would say that the test suite could still use some work, but I'm using numpy-indexed in various production cases myself, so I'm pretty confident about its robustness.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.