How to get a sorted cumulative array of values in numpy?

Question

I have the following numpy arrays (which are actually a pandas column) which represent observations (a position and a value):

df['x'] = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
df['y'] = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

And instead, I would like to get the following two arrays:

[1 2 3 4 5]
[4 3 2 3 2]

Which is basically grouping all items with the same value in df['x'] and getting the cumulative sum of each value in df['y'], (or in other words getting the cumulative sum of values for each individual position).

Which is the most straightforward way to achieve that in numpy?

since they are in a dataframe, i think you can just do df.groupby('x', as_index=False)['y'].sum() — tdy
– tdy, Commented Feb 20, 2022 at 22:51
Is there a reason why you don't want to use pandas for this? — Michael Butscher
– Michael Butscher, Commented Feb 20, 2022 at 22:51
You can just export the result in numpy or you want to specifically do all of this in numpy? — Akmal Soliev
– Akmal Soliev, Commented Feb 20, 2022 at 22:52
I am curious to understand -for learning purposes- how this could be done -if there is such option- purely in numpy. — M.E.
– M.E., Commented Feb 20, 2022 at 22:55

Andras Deak -- Слава Україні · Accepted Answer · 2022-02-20 23:02:34Z

2

As others have noted in comments, if you're already using pandas it's probably a good idea to use a sum over groupby. That being said, if you insist on using raw NumPy you can find the unique indices of x and then sum up corresponding values in y in an accumulator array:

import numpy as np

x = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
y = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

vals, inds = np.unique(x, return_inverse=True)
res = np.zeros_like(vals, dtype=y.dtype)
np.add.at(res, inds, y)

print(res)
# [4 3 2 3 2]

vals are the unique values in x and are not actually used here. inds is the key: these are the index of each value of x in vals. These are the positions in the result where we want to accumulate corresponding values from y. The last trick is using np.add.at for an unbuffered summation.

The result is stored in res.

edited Feb 20, 2022 at 23:02

answered Feb 20, 2022 at 22:56

Andras Deak -- Слава Україні

35.4k13 gold badges94 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BENY · Accepted Answer · 2022-02-21 01:36:50Z

1

We can try

def groupby(a, b):
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
    out = [sum(a_sorted[i:j]) for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out


groupby(df['y'].values,df['x'].values)
Out[223]: [4, 3, 2, 3, 2]

Notice the original function you can refer to Divakar 's answer (Thanks Divakar again :-), for teaching me bumpy)

answered Feb 21, 2022 at 1:36

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

How to get a sorted cumulative array of values in numpy?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related