2

I have the following numpy arrays (which are actually a pandas column) which represent observations (a position and a value):

df['x'] = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
df['y'] = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

And instead, I would like to get the following two arrays:

[1 2 3 4 5]
[4 3 2 3 2]

Which is basically grouping all items with the same value in df['x'] and getting the cumulative sum of each value in df['y'], (or in other words getting the cumulative sum of values for each individual position).

Which is the most straightforward way to achieve that in numpy?

5
  • 3
    since they are in a dataframe, i think you can just do df.groupby('x', as_index=False)['y'].sum() Commented Feb 20, 2022 at 22:51
  • 1
    Is there a reason why you don't want to use pandas for this? Commented Feb 20, 2022 at 22:51
  • Thanks, Is there any way to do it purely in numpy? Commented Feb 20, 2022 at 22:51
  • You can just export the result in numpy or you want to specifically do all of this in numpy? Commented Feb 20, 2022 at 22:52
  • 1
    I am curious to understand -for learning purposes- how this could be done -if there is such option- purely in numpy. Commented Feb 20, 2022 at 22:55

2 Answers 2

2

As others have noted in comments, if you're already using pandas it's probably a good idea to use a sum over groupby. That being said, if you insist on using raw NumPy you can find the unique indices of x and then sum up corresponding values in y in an accumulator array:

import numpy as np

x = np.array([1, 2, 3, 2, 1, 1, 2, 3, 4, 5])
y = np.array([2, 1, 1, 1, 1, 1, 1, 1, 3, 2])

vals, inds = np.unique(x, return_inverse=True)
res = np.zeros_like(vals, dtype=y.dtype)
np.add.at(res, inds, y)

print(res)
# [4 3 2 3 2]

vals are the unique values in x and are not actually used here. inds is the key: these are the index of each value of x in vals. These are the positions in the result where we want to accumulate corresponding values from y. The last trick is using np.add.at for an unbuffered summation.

The result is stored in res.

Sign up to request clarification or add additional context in comments.

Comments

1

We can try

def groupby(a, b):
    sidx = b.argsort(kind='mergesort')
    a_sorted = a[sidx]
    b_sorted = b[sidx]
    cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
    out = [sum(a_sorted[i:j]) for i,j in zip(cut_idx[:-1],cut_idx[1:])]
    return out


groupby(df['y'].values,df['x'].values)
Out[223]: [4, 3, 2, 3, 2]

Notice the original function you can refer to Divakar 's answer (Thanks Divakar again :-), for teaching me bumpy)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.