Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

Question

here is what my data looks like:

a = np.array([[1,2],[2,1],[7,1],[3,2]])

I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.

I want to sum all values in the first column that have the same value in second column. Is there an inbuilt numpy function for this?

For example a sum for each 1 in the second column would be: 2 + 7 = 9

Community · Accepted Answer · 2017-05-23 12:32:49Z

2

A short but a bit dodgy way is through numpy function bincount:

np.bincount(a[:,1], weights=a[:,0])

What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1] which is the list of your category numbers). Now, weights is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.

What it return is this:

array([ 0.,  9.,  4.])

where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.

If they are not consecutive integers starting from 0, you can select those you need by doing:

np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]

This will return

array([9.,  4.])

which is an array of sums, sorted by the second element (because unique returns a sorted list).

If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank function, for example):

ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])

This will return array([9., 4.]), in order sorted by your second element. You can zip them to pair sums with appropriate elements:

zip(np.unique(a[:,1]), sums)

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

answered Feb 3, 2014 at 1:03

sashkello

18k25 gold badges84 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Abhinav Kumar Over a year ago

It is a good function to know. Thanks. However, my columns are floats not integers.

sashkello Over a year ago

@AbhinavKumar This is quite a bad idea, as I link above. However, see edited answer for making it work with floats.

lightalchemist Over a year ago

@AbhinavKumar Are column 2 values integers (even if they are stored as floats)? If not equality comparison will be tricky.

lightalchemist · Accepted Answer · 2014-02-03 03:10:16Z

Contents of play.py

import numpy as np

def compute_sum1(a):
    unique = np.unique(a[:, 1])
    same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
    # First coordinate of tuple contains value of col 2
    # Second coordinate contains the sum of entries from col 1
    same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
    return same_sum

def compute_sum2(a):
    """A minimal implementation of compute_sum"""
    unique = np.unique(a[:, 1])
    same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
    same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
    return same_sum

def compute_sum3(a):
    unique = np.unique(a[:, 1])
    same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
    same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
    return same_sum

def main():
    a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
    print("compute_sum1")
    print(compute_sum1(a))
    print("compute_sum3")
    print(compute_sum3(a))
    print("compute_sum2")
    same_sum = [s for s in compute_sum2(a)]
    print(same_sum)


if __name__ == '__main__':
    main()

Output:

In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9.  4.]
compute_sum2
[9.0, 4.0]

In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop

In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop

In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop

Note that compute_sum2() is the fastest. If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient. Similarly, same_sum in compute_sum1() can be converted to a generator comprehension by replacing [] with ().

Michael · Accepted Answer · 2014-02-03 03:17:00Z

2

You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.

a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])

The first 0 means, that there were no fields with 0 in the index column.

answered Feb 3, 2014 at 3:17

Michael

7,8061 gold badge41 silver badges64 bronze badges

Comments

sashkello · Accepted Answer · 2014-02-03 00:40:57Z

1

The easiest straightforward way I see is though list comprehension:

s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]

However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.

answered Feb 3, 2014 at 0:40

sashkello

18k25 gold badges84 silver badges112 bronze badges

3 Comments

Abhinav Kumar Over a year ago

The data I show in my question is simplified version of what I have. I cannot manually do what you suggest hundred times.

Abhinav Kumar Over a year ago

Thanks for your reply. I am curious, about any other alternative I should look into apart from list comprehension?

sashkello Over a year ago

@AbhinavKumar Maybe some other answers could show up in time, I reckon there might be some numpy functions which do it in a couple of steps...

JaminSore · Accepted Answer · 2014-02-03 15:18:36Z

1

As far as I know, there is no function to do this in numpy, but this can easily be done with pandas.DataFrame.groupby.

In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]: 
1
1    9
2    4
Name: 0, dtype: int64

Of course, you could do the same thing with itertools.groupby

In [1]: import numpy as np
   ...: from itertools import groupby
   ...: from operator import itemgetter
   ...: 

In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])

In [4]: sa = sorted(a.tolist(), key=itemgetter(1))

In [5]: grouper = groupby(sa, key=itemgetter(1))

In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}

In [7]: sums
Out[7]: {1: 9, 2: 4}

edited Feb 3, 2014 at 15:18

answered Feb 3, 2014 at 7:13

JaminSore

3,9761 gold badge27 silver badges21 bronze badges

Collectives™ on Stack Overflow

Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

5 Answers 5

3 Comments

Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related