Speed up for loop with numpy

Question

How can this next for-loop get a speedup with numpy? I guess some fancy indexing-trick can be used here, but i have no idea which one (can einsum be used here?).

a=0
for i in range(len(b)):
    a+=numpy.mean(C[d,e,f+b[i]])*g[i]

edit: C is a numpy 3D array of shape comparable to (20, 1600, 500). d,e,f are indices of points that are "interesting" (lengths of d,e,f are the same and around 900) b and g have the same length (around 50). The mean is taken over all the points in C with the indices d,e,f+b[i]

It would help to have a MCVE. What is C, what is d, e, f, g and b? What do you actually want to compute? You sum over means,... why? Please give a bit of context, so that it becomes possible to see what optimisation is possible. — Unapiedra
– Unapiedra, Commented Oct 16, 2014 at 12:06
Side note, you can use enumerate in your for loop to get indexes: for i, item in enumerate(b): — Di Zou
– Di Zou, Commented Oct 16, 2014 at 14:11

Saullo G. P. Castro · Accepted Answer · 2014-10-16 13:54:34Z

3

You can do the following trick:

C[d, e][:, np.add.outer(f, b)].dot(g).diagonal().mean()

improving even more, by prematurely taking the elements that will form the diagonal:

C[d, e][np.arange(len(d))[:, None], np.add.outer(f, b)].dot(g).mean()

edited Oct 16, 2014 at 13:54

answered Oct 16, 2014 at 12:50

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

usethedeathstar Over a year ago

Your second version is two times slower than my version, but i noticed in the added info, that i swapped two values - so maybe thats the case (now i re-updated with the right values). Your type of solution is the kind of thing i am looking for though

Saullo G. P. Castro Over a year ago

@usethedeathstar and what about the first version? does it perform faster? I had the impression that the second version would scale better, but didn't test it though...

usethedeathstar Over a year ago

The first version is about 300times slower

Saullo G. P. Castro Over a year ago

@usethedeathstar which Numpy version are you using. I am on 1.9.0 and the second solution is faster than yours, because they improved the fancy indexing performance very significantly...

usethedeathstar Over a year ago

Does anyone know how much the speed difference is for this construction between 1.8.0 and 1.9.0?

|

immerrr · Accepted Answer · 2014-10-20 14:39:34Z

1

Timings

Both sessions were initialized with

In [1]: C = np.random.rand(20,1600,500)

In [2]: d = np.random.randint(0, 20, size=900)

In [3]: e = np.random.randint(1600, size=900)

In [4]: f = np.random.randint(400, size=900)

In [5]: b = np.random.randint(100, size=50)

In [6]: g = np.random.rand(50)

Numpy 1.9.0

In [7]: %timeit C[d,e,f + b[:,np.newaxis]].mean(axis=1).dot(g)
1000 loops, best of 3: 942 µs per loop

In [8]: %timeit C[d[:,np.newaxis],e[:, np.newaxis],f[:, np.newaxis] + b].mean(axis=0).dot(g)
1000 loops, best of 3: 762 µs per loop

In [9]: %%timeit                                               
   ...: a = 0
   ...: for i in range(len(b)):                                     
   ...:     a += np.mean(C[d, e, f + b[i]]) * g[i]
   ...: 
100 loops, best of 3: 2.25 ms per loop

In [10]: np.__version__
Out[10]: '1.9.0'

In [11]: %%timeit
(C.ravel()[np.ravel_multi_index((d[:,np.newaxis],
                                 e[:,np.newaxis],
                                 f[:,np.newaxis] + b), dims=C.shape)]
 .mean(axis=0).dot(g))
   ....: 
1000 loops, best of 3: 940 µs per loop

Numpy 1.8.2

In [7]: %timeit C[d,e,f + b[:,np.newaxis]].mean(axis=1).dot(g)
100 loops, best of 3: 2.81 ms per loop

In [8]: %timeit C[d[:,np.newaxis],e[:, np.newaxis],f[:, np.newaxis] + b].mean(axis=0).dot(g)
100 loops, best of 3: 2.7 ms per loop

In [9]: %%timeit                                               
   ...: a = 0
   ...: for i in range(len(b)):                                     
   ...:     a += np.mean(C[d, e, f + b[i]]) * g[i]
   ...: 
100 loops, best of 3: 4.12 ms per loop

In [10]: np.__version__
Out[10]: '1.8.2'

In [51]: %%timeit
(C.ravel()[np.ravel_multi_index((d[:,np.newaxis],
                                 e[:,np.newaxis],
                                 f[:,np.newaxis] + b), dims=C.shape)]
 .mean(axis=0).dot(g))
   ....: 
1000 loops, best of 3: 1.4 ms per loop

Description

You can use coordinate broadcasting trick to flesh out your 50x900 array from the beginning:

In [158]: C[d,e,f + b[:, np.newaxis]].shape
Out[158]: (50, 900)

From that point, mean and dot will get you to the destination:

In [159]: C[d,e,f + b[:, np.newaxis]].mean(axis=1).dot(g)
Out[159]: 13.582349962518611

In [160]: 
a = 0
for i in range(len(b)):       
    a += np.mean(C[d, e, f + b[i]]) * g[i]
print(a)
   .....: 
13.5823499625

And it's about 3.3x faster than the loop version:

In [161]: %timeit C[d,e,f + b[:, np.newaxis]].mean(axis=1).dot(g)
1000 loops, best of 3: 585 µs per loop

In [162]: %%timeit                                               
a = 0
for i in range(len(b)):                                     
    a += np.mean(C[d, e, f + b[i]]) * g[i]
   .....: 
1000 loops, best of 3: 1.95 ms per loop

The array is of significant size, so you must factor in CPU cache. I cannot say I know how np.sum traverses the array, but in 2d arrays there is always a slightly better way (when the next element you pick is adjacent memory-wise) and a slightly worse way (when the next element is found across the stride). Let's see if we can win something more by transposing the array during indexing:

In [196]: C[d[:,np.newaxis], e[:,np.newaxis], f[:,np.newaxis] + b].mean(axis=0).dot(g)
Out[196]: 13.582349962518608

In [197]: %timeit C[d[:,np.newaxis], e[:,np.newaxis], f[:,np.newaxis] + b].mean(axis=0).dot(g)
1000 loops, best of 3: 461 µs per loop

That's 4.2x faster than the loop.

edited Oct 20, 2014 at 14:39

answered Oct 17, 2014 at 20:34

immerrr

1,2737 silver badges13 bronze badges

5 Comments

usethedeathstar Over a year ago

Is your solution with numpy 1.9.0 or is it faster in 1.8.0 also?

immerrr Over a year ago

Interesting, np1.8 is indeed significantly slower than 1.9. I've updated the answer with 1.8.2 timings.

immerrr Over a year ago

Curiouser, there's a "1.9ms" constant slowdown for each of the three compared variants.

immerrr Over a year ago

It appears there was a significant pessimization of multi-dim integer lookups, same lookup but performed with ravelled indexers is significantly faster on numpy-1.8 (see the updated benchmarks)

usethedeathstar Over a year ago

changed to take this as my accepted answer, since this shows great benchmarking of several solutions, both for 1.9 and 1.8

score 1 · Accepted Answer · 2014-10-16 20:36:26Z

1

It's quite similar to the loopy version:

np.sum(np.mean(C[d,e,f+b[:,None]], axis=1) * g)

And you can combine the summation and multiplication into a dot product:

C[d,e,f+b[:,None]].mean(1).dot(g)

But for the timing that doesn't seem to matter; The indexing operation is by far the most time consuming operation of all (at least on Numpy 1.8.0). Compared to that, the loop overhead in the original code is insignificant.

edited Oct 16, 2014 at 20:36

answered Oct 16, 2014 at 20:16

user2379410

Comments

Flavian Hautbois · Accepted Answer · 2014-10-16 12:33:35Z

The only speed you could hope for in structural terms would be with the following code:

#Initialize a 4-D array
aggregated = numpy.zeros((len(d), len(e), len(f), len(b)))
#Populate it by the shifted copies of C
for i in range(len(b)):
    aggregated[:, :, :, i] = C[d, e, f + b[i]]

#Compute the mean on the first three axes
means = numpy.mean(aggregated, axis=(0, 1, 2))
#Multiply term-by-term by g (be careful that means and g have the same size!) and sum
a = numpy.sum(means * g)

However this does not guarantee that the calculation will be faster, it may even be slower for the following reasons:

Populating the 4-D array has a non-negligible cost since it copies memory
b is very small, so you won't win much anyway. If b is bigger this can become interesting, as long as d, e, f get smaller too

In any case you should benchmark both solutions. You could also try using something like Cython to perform the for loop, but this seems like an overkill.

Collectives™ on Stack Overflow

Speed up for loop with numpy

4 Answers 4

7 Comments

Timings

Numpy 1.9.0

Numpy 1.8.2

Description

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

Timings

Numpy 1.9.0

Numpy 1.8.2

Description

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related