Consolidate duplicate rows of an array

Question

I have a numpy array that I need to consolidate by combining the rows with duplicate entries (based on the first column), while preserving any positive values of the other columns. My array looks like this.

array([[117,   0,   1,   0,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [117,   0,   0,   0,   0,   1],
       [120,   0,   1,   0,   0,   0],
       [189,   0,   0,   0,   1,   0],
       [117,   1,   0,   0,   0,   0],
       [120,   0,   0,   1,   0,   0]])

I'm trying to make the output look like this:

array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

I've been able to use unique on column zero to filter out the duplicates, but I can't seem to preserve the values of the other columns. I would appreciate any input!

This is certainly possible (but a little fiddly) in NumPy. Are you open to solutions in other libraries? — Alex Riley
– Alex Riley, Commented Jan 4, 2016 at 19:08
And does the order of the returned rows matter (e.g. can the first column be 117, 120, 163, 189)? — Alex Riley
– Alex Riley, Commented Jan 4, 2016 at 19:29

Alex Riley · Accepted Answer · 2016-01-04 19:34:06Z

3

A pure NumPy solution could work like this (I've named your starting array a):

>>> b = a[np.argsort(a[:, 0])]
>>> grps, idx = np.unique(b[:, 0], return_index=True)
>>> counts = np.add.reduceat(b[:, 1:], idx)
>>> np.column_stack((grps, counts))
array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

This returns the rows in sorted order (by label).

A solution in pandas is possible in fewer lines (and potentially uses less additional memory than the NumPy method):

>>> df = pd.DataFrame(a)
>>> df.groupby(0, sort=False, as_index=False).sum().values
array([[117,   1,   1,   0,   0,   1],
       [163,   1,   0,   0,   0,   0],
       [120,   0,   1,   1,   0,   0],
       [189,   0,   0,   0,   1,   0]])

The sort=False parameter means that the rows are returned in the order the unique labels were first encountered.

answered Jan 4, 2016 at 19:34

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Padraic Cunningham Over a year ago

Just beat me to the pandas solution. +1

Alex Riley Over a year ago

Cheers - yep, I think pandas is often the way to go with these grouping/aggregation problems.

Fred Truter · Accepted Answer · 2016-01-04 19:30:35Z

If you don't mind that the rows are re-ordered randomly, then a dictionary hash could work.

def consolidate(input):
    unique = { }
    for row in input:
        id = row[0]
        if id not in unique:
            unique[id] = row
        else:
            for i in range(1, len(row)):
                unique[id][i] |= row[i]
    return unique.values()

This results in:-

[[120, 0, 1, 1, 0, 0],
 [163, 1, 0, 0, 0, 0],
 [117, 1, 1, 0, 0, 1],
 [189, 0, 0, 0, 1, 0]]

If you do want row sequence to be preserved then a little more work is needed:-

def consolidate(input):
    unique = { }
    sequence = 0

    for row in input:
        id = row[0]
        row = [sequence] + row
        sequence += 1
        if id not in unique:
            unique[id] = row
        else:
            for i in range(2, len(row)):
                unique[id][i] |= row[i]
    return [row[1:] for row in sorted(unique.values())]

This now results in:-

[[117, 1, 1, 0, 0, 1],
 [163, 1, 0, 0, 0, 0],
 [120, 0, 1, 1, 0, 0],
 [189, 0, 0, 0, 1, 0]]

Collectives™ on Stack Overflow

Consolidate duplicate rows of an array

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related