3

I have a Numpy structured array that is sorted by the first column:

x = array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])

I need to merge records (sum the values of the second column) where

x[n][0] == x[n + 1][0]

In this case, the desired output would be:

x = array([(2, 11), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])

What's the best way to achieve this?

1
  • Please edit this question to reflect the structured array you gave in a comment: array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')]). You existing question looks more like a 2d array. Commented Aug 14, 2015 at 16:10

4 Answers 4

3

You can use np.unique to get an ID array for each element in the first column and then use np.bincount to perform accumulation on the second column elements based on the IDs -

In [140]: A
Out[140]: 
array([[25,  1],
       [37,  3],
       [37,  2],
       [47,  1],
       [59,  2]])

In [141]: unqA,idx = np.unique(A[:,0],return_inverse=True)

In [142]: np.column_stack((unqA,np.bincount(idx,A[:,1])))
Out[142]: 
array([[ 25.,   1.],
       [ 37.,   5.],
       [ 47.,   1.],
       [ 59.,   2.]])

You can avoid np.unique with a combination of np.diff and np.cumsum which might help because np.unique also does sorting internally, which is not needed in this case as the input data is already sorted. The implementation would look something like this -

In [201]: A
Out[201]: 
array([[25,  1],
       [37,  3],
       [37,  2],
       [47,  1],
       [59,  2]])

In [202]: unq1 = np.append(True,np.diff(A[:,0])!=0)

In [203]: np.column_stack((A[:,0][unq1],np.bincount(unq1.cumsum()-1,A[:,1])))
Out[203]: 
array([[ 25.,   1.],
       [ 37.,   5.],
       [ 47.,   1.],
       [ 59.,   2.]])
Sign up to request clarification or add additional context in comments.

8 Comments

I am getting the following error: Traceback (most recent call last): File "/home/krlk89/abc.py", line 8, in <module> unq1 = np.append(True,np.diff(x[:,0])!=0) IndexError: too many indices
@krlk89 What's the shape of x? x.shape output?
>>> x array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')]) >>> x.shape (3,)
@krlk89 Do something like A = np.column_stack((x['recod'],x['count'])) and then use the solutions, assuming A as the input to the solutions? Or in the solution codes, use x['recod'] in place of A[:,0] and x['count'] to replace A[:,1]
I added an answer based off this one, adapted to structured arrays.
|
2

Dicakar's answer cast in structured array form:

In [500]: x=np.array([(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)], dtype=[('recod', '<u8'), ('count', '<u4')])

Find unique values and count duplicates:

In [501]: unqA, idx=np.unique(x['recod'], return_inverse=True)    
In [502]: cnt = np.bincount(idx, x['count'])

Make a new structured array and fill the fields:

In [503]: x1 = np.empty(unqA.shape, dtype=x.dtype)
In [504]: x1['recod'] = unqA
In [505]: x1['count'] = cnt

In [506]: x1
Out[506]: 
array([(25, 1), (37, 5), (47, 1), (59, 2)], 
      dtype=[('recod', '<u8'), ('count', '<u4')])

There is a recarray function that builds an array from a list of arrays:

In [507]: np.rec.fromarrays([unqA,cnt],dtype=x.dtype)
Out[507]: 
rec.array([(25, 1), (37, 5), (47, 1), (59, 2)], 
      dtype=[('recod', '<u8'), ('count', '<u4')])

Internally it does the same thing - build an empty array of the right size and dtype, and then loop over over the dtype fields. A recarray is just a structured array in a specialized array subclass wrapper.

There are two ways of populating a structured array (especially with a diverse dtype) - with a list of tuples as you did with x, and field by field.

2 Comments

Thanks a lot for your help!
Thanks for helping out OP on this! I wasn't really familiar with the structured arrays thing.
2

pandas makes this type of "group-by" operation trivial:

In [285]: import pandas as pd

In [286]: x = [(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)]

In [287]: df = pd.DataFrame(x)

In [288]: df
Out[288]: 
    0  1
0  25  1
1  37  3
2  37  2
3  47  1
4  59  2

In [289]: df.groupby(0).sum()
Out[289]: 
    1
0    
25  1
37  5
47  1
59  2

You probably won't want the dependency on pandas if this is the only operation you need from it, but once you get started, you might find other useful bits in the library.

3 Comments

Thanks for your help! I tried this and got an error message: pastebin.com/mA6fDT3u
I see you changed the format of your array. In that case, use df.groupby('recod').sum().
Thanks! It works now, but how can I get back the structure of my initial array?
1

You can use np.reduceat. You just need to populate where x[:, 0] changes which is equivalent to non zero indices of np.diff(x[:,0]) shifted by one plus the initial index 0:

>>> i = np.r_[0, 1 + np.nonzero(np.diff(x[:,0]))[0]]
>>> a, b = x[i, 0], np.add.reduceat(x[:, 1], i)
>>> np.vstack((a, b)).T
array([[25,  1],
       [37,  5],
       [47,  1],
       [59,  2]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.