Summing data from array based on other array in Numpy

Question

I have two 2D numpy arrays (simplified in this example with respect to size and content) with identical sizes.

An ID matrix:

and a value matrix:

14.8 17.0 74.3 40.3 90.2
25.2 75.9  5.6 40.0 33.7
78.9 39.3 11.3 63.6 56.7
11.4 75.7 78.4 88.7 58.6
79.6 32.3 35.3 52.5 13.3

My goal is to count and sum the values from the second matrix grouped by the IDs from the first matrix:

1: (8, 336.8)
2: (9, 453.4)
5: (8, 402.4)

I can do this in a for loop but when the matrices have sizes in thousands instead of just 5x5 and thousands of unique ID's, it takes a lot of time to process.

Does numpy have a clever method or a combination of methods for doing this?

Divakar · Accepted Answer · 2016-04-15 09:48:35Z

6

Here's a vectorized approach to get the counts for ID and ID-based summed values for value with a combination of np.unique and np.bincount -

unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)

value_sums = np.bincount(idx,value.ravel())

To get the final output as a dictionary, you can use loop-comprehension to gather the summed values, like so -

{i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}

Sample run -

In [86]: ID
Out[86]: 
array([[1, 1, 1, 2, 2],
       [1, 1, 2, 2, 5],
       [1, 1, 2, 5, 5],
       [1, 2, 2, 5, 5],
       [2, 2, 5, 5, 5]])

In [87]: value
Out[87]: 
array([[ 14.8,  17. ,  74.3,  40.3,  90.2],
       [ 25.2,  75.9,   5.6,  40. ,  33.7],
       [ 78.9,  39.3,  11.3,  63.6,  56.7],
       [ 11.4,  75.7,  78.4,  88.7,  58.6],
       [ 79.6,  32.3,  35.3,  52.5,  13.3]])

In [88]: unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)
    ...: value_sums = np.bincount(idx,value.ravel())
    ...: 

In [89]: {i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}
Out[89]: 
{1: (8, 336.80000000000001),
 2: (9, 453.40000000000003),
 5: (8, 402.40000000000003)}

edited Apr 15, 2016 at 9:48

answered Apr 15, 2016 at 9:40

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MB-F Over a year ago

Nice one! I was not aware of the return_* arguments for np.unique.

Chau Over a year ago

@Divakar: Thank You! This was exactly the kind of solution I was looking for with a good performance due to the vectorisation.

MB-F · Accepted Answer · 2016-04-15 09:34:42Z

1

This is possible with a combination of a few simple methods:

use numpy.unique to find each ID
create a boolean mask for each ID
sum the 1s in the mask (count) and the values where the mask is 1

This can look like this:

import numpy as np

ids = np.array([[1, 1, 1, 2, 2],
                [1, 1, 2, 2, 5],
                [1, 1, 2, 5, 5],
                [1, 2, 2, 5, 5],
                [2, 2, 5, 5, 5]])

values = np.array([[14.8, 17.0, 74.3, 40.3, 90.2],
                   [25.2, 75.9,  5.6, 40.0, 33.7],
                   [78.9, 39.3, 11.3, 63.6, 56.7],
                   [11.4, 75.7, 78.4, 88.7, 58.6],
                   [79.6, 32.3, 35.3, 52.5, 13.3]])


for i in np.unique(ids):  # loop through all IDs
    mask = ids == i  # find entries that match current ID
    count = np.sum(mask)  # number of matches
    total = np.sum(values[mask])  # values of matches
    print('{}: ({}, {:.1f})'.format(i, count, total))  #print result

# Output:
# 1: (8, 336.8)
# 2: (9, 453.4)
# 5: (8, 402.4)

answered Apr 15, 2016 at 9:34

MB-F

23.8k5 gold badges71 silver badges127 bronze badges

3 Comments

Chau Over a year ago

Its exactly that nasty for loop I'm referring to in my question, I should have been more clear on that though.

MB-F Over a year ago

I think there is not really a succint way of doing that without the for loop. It may be possible, but would likely lead to very unreadable code. If you only have a few unique IDs there should not be a too big performace hit by the for loop. Anyway, I will think about it for a while...

MB-F Over a year ago

Looks like I was just proven wrong in Divakar's answer.

Eelco Hoogendoorn · Accepted Answer · 2016-04-15 10:54:16Z

0

The numpy_indexed package (disclaimer: I am its author) has functionality to solve these kind of problems in an elegant and vectorized manner:

import numpy_indexed as npi
group_by = npi.group_by(ID.flatten())
ID_unique, value_sums = group_by.sum(value.flatten())
ID_count = groupy_by.count

Note: if you want to compute the sum and count in order to compute a mean, there is also group_by.mean; plus a lot of other useful functionality.

edited Apr 15, 2016 at 10:54

answered Apr 15, 2016 at 10:47

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Collectives™ on Stack Overflow

Summing data from array based on other array in Numpy

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related