1

here is what my data looks like:

a = np.array([[1,2],[2,1],[7,1],[3,2]])

I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.

I want to sum all values in the first column that have the same value in second column. Is there an inbuilt numpy function for this?

For example a sum for each 1 in the second column would be: 2 + 7 = 9

5 Answers 5

2

A short but a bit dodgy way is through numpy function bincount:

np.bincount(a[:,1], weights=a[:,0])

What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1] which is the list of your category numbers). Now, weights is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.

What it return is this:

array([ 0.,  9.,  4.])

where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.

If they are not consecutive integers starting from 0, you can select those you need by doing:

np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]

This will return

array([9.,  4.])

which is an array of sums, sorted by the second element (because unique returns a sorted list).


If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank function, for example):

ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])

This will return array([9., 4.]), in order sorted by your second element. You can zip them to pair sums with appropriate elements:

zip(np.unique(a[:,1]), sums) 
Sign up to request clarification or add additional context in comments.

3 Comments

It is a good function to know. Thanks. However, my columns are floats not integers.
@AbhinavKumar This is quite a bad idea, as I link above. However, see edited answer for making it work with floats.
@AbhinavKumar Are column 2 values integers (even if they are stored as floats)? If not equality comparison will be tricky.
2

Contents of play.py

import numpy as np

def compute_sum1(a):
    unique = np.unique(a[:, 1])
    same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
    # First coordinate of tuple contains value of col 2
    # Second coordinate contains the sum of entries from col 1
    same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
    return same_sum

def compute_sum2(a):
    """A minimal implementation of compute_sum"""
    unique = np.unique(a[:, 1])
    same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
    same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
    return same_sum

def compute_sum3(a):
    unique = np.unique(a[:, 1])
    same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
    same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
    return same_sum

def main():
    a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
    print("compute_sum1")
    print(compute_sum1(a))
    print("compute_sum3")
    print(compute_sum3(a))
    print("compute_sum2")
    same_sum = [s for s in compute_sum2(a)]
    print(same_sum)


if __name__ == '__main__':
    main()

Output:

In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9.  4.]
compute_sum2
[9.0, 4.0]

In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop

In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop

In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop

Note that compute_sum2() is the fastest. If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient. Similarly, same_sum in compute_sum1() can be converted to a generator comprehension by replacing [] with ().

Comments

2

You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.

a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])

The first 0 means, that there were no fields with 0 in the index column.

Comments

1

The easiest straightforward way I see is though list comprehension:

s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]

However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.

3 Comments

The data I show in my question is simplified version of what I have. I cannot manually do what you suggest hundred times.
Thanks for your reply. I am curious, about any other alternative I should look into apart from list comprehension?
@AbhinavKumar Maybe some other answers could show up in time, I reckon there might be some numpy functions which do it in a couple of steps...
1

As far as I know, there is no function to do this in numpy, but this can easily be done with pandas.DataFrame.groupby.

In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]: 
1
1    9
2    4
Name: 0, dtype: int64

Of course, you could do the same thing with itertools.groupby

In [1]: import numpy as np
   ...: from itertools import groupby
   ...: from operator import itemgetter
   ...: 

In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])

In [4]: sa = sorted(a.tolist(), key=itemgetter(1))

In [5]: grouper = groupby(sa, key=itemgetter(1))

In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}

In [7]: sums
Out[7]: {1: 9, 2: 4}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.