Compound assignment operators in Python's Numpy library

Question

The "vectorizing" of fancy indexing by Python's numpy library sometimes gives unexpected results. For example:

import numpy
a = numpy.zeros((1000,4), dtype='uint32')
b = numpy.zeros((1000,4), dtype='uint32')
i = numpy.random.random_integers(0,999,1000)
j = numpy.random.random_integers(0,3,1000)

a[i,j] += 1
for k in xrange(1000):
    b[i[k],j[k]] += 1

Gives different results in the arrays 'a' and 'b' (i.e. the appearance of tuple (i,j) appears as 1 in 'a' regardless of repeats, whereas repeats are counted in 'b'). This is easily verified as follows:

numpy.sum(a)
883
numpy.sum(b)
1000

It is also notable that the fancy indexing version is almost two orders of magnitude faster than the for loop. My question is: "Is there an efficient way for numpy to compute the repeat counts as implemented using the for loop in the provided example?"

semantics are different: for-loop works in sequence (values for repeated indexes are incremented multiple times), but a += 1 is equivalent to a = a + 1 (increments once for repeated indexes). — jfs
– jfs, Commented Jun 12, 2012 at 17:26
I know there is a difference in semantics (I should not have said 'unexpected' in my original post), but is there an efficient way to compute the for-loop semantics (i.e. count occurences of (i,j) tuples) using numPy? — user1451766
– user1451766, Commented Jun 12, 2012 at 17:38
lazy option is to use Cython. It might be both faster and more readable than numpy function-based version. Though the later forces to think about the operation in more general terms (it might be a good thing) and the resulting solution could be more extendable e.g., stackoverflow.com/questions/4962606/… — jfs
– jfs, Commented Jun 12, 2012 at 18:35

ecatmur · Accepted Answer · 2012-06-13 15:39:32Z

6

This should do what you want:

np.bincount(np.ravel_multi_index((i, j), (1000, 4)), minlength=4000).reshape(1000, 4)

As a breakdown, ravel_multi_index converts the index pairs specified by i and j to integer indices into a C-flattened array; bincount counts the number of times each value 0..4000 appears in that list of indices; and reshape converts the C-flattened array back to a 2d array.

In terms of performance, I measure it at 200 times faster than "b", and 5 times faster than "a"; your mileage may vary.

Since you need to write the counts to an existing array a, try this:

u, inv = np.unique(np.ravel_multi_index((i, j), (1000, 4)), return_inverse=True)
a.flat[u] += np.bincount(inv)

I make this second method a little slower (2x) than "a", which isn't too surprising as the unique stage is going to be slow.

edited Jun 13, 2012 at 15:39

answered Jun 12, 2012 at 17:03

ecatmur

158k28 gold badges311 silver badges387 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user1451766 Over a year ago

I considered using bincount(). However, my array is actually much larger than the in example I provided (>300*106). And the number of updates (i,j) pairs is relatively smaller (around 20*106) with many repeats. I am afraid that the bincount approach will use too much memory.

ecatmur Over a year ago

@user1451766 - bincount won't use any more memory than you need to construct a = numpy.zeros(...) in your original post.

user1451766 Over a year ago

In my case, 'a' is a memmap() file. Neither of the fancy-indexing or for-loop approaches requires an intermediate memory buffer that is as large.

user1451766 Over a year ago

However, your ravel_multi_index in combination with unique( ,return_inverse=True) might do the trick!

ecatmur Over a year ago

I added another method, is that what you were after?

|

Collectives™ on Stack Overflow

Compound assignment operators in Python's Numpy library

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related