1

I have a numpy array that I need to consolidate by combining the rows with duplicate entries (based on the first column), while preserving any positive values of the other columns. My array looks like this.

array([[117,   0,   1,   0,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [117,   0,   0,   0,   0,   1],
       [120,   0,   1,   0,   0,   0],
       [189,   0,   0,   0,   1,   0],
       [117,   1,   0,   0,   0,   0],
       [120,   0,   0,   1,   0,   0]])

I'm trying to make the output look like this:

array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

I've been able to use unique on column zero to filter out the duplicates, but I can't seem to preserve the values of the other columns. I would appreciate any input!

4
  • This is certainly possible (but a little fiddly) in NumPy. Are you open to solutions in other libraries? Commented Jan 4, 2016 at 19:08
  • Yes, I'm open to other libraries. Commented Jan 4, 2016 at 19:11
  • And does the order of the returned rows matter (e.g. can the first column be 117, 120, 163, 189)? Commented Jan 4, 2016 at 19:29
  • The order doesn't matter. Commented Jan 4, 2016 at 19:31

2 Answers 2

3

A pure NumPy solution could work like this (I've named your starting array a):

>>> b = a[np.argsort(a[:, 0])]
>>> grps, idx = np.unique(b[:, 0], return_index=True)
>>> counts = np.add.reduceat(b[:, 1:], idx)
>>> np.column_stack((grps, counts))
array([[117,   1,   1,   0,   0,   1],
       [120,   0,   1,   1,   0,   0],
       [163,   1,   0,   0,   0,   0],
       [189,   0,   0,   0,   1,   0]])

This returns the rows in sorted order (by label).

A solution in pandas is possible in fewer lines (and potentially uses less additional memory than the NumPy method):

>>> df = pd.DataFrame(a)
>>> df.groupby(0, sort=False, as_index=False).sum().values
array([[117,   1,   1,   0,   0,   1],
       [163,   1,   0,   0,   0,   0],
       [120,   0,   1,   1,   0,   0],
       [189,   0,   0,   0,   1,   0]])

The sort=False parameter means that the rows are returned in the order the unique labels were first encountered.

Sign up to request clarification or add additional context in comments.

2 Comments

Just beat me to the pandas solution. +1
Cheers - yep, I think pandas is often the way to go with these grouping/aggregation problems.
0

If you don't mind that the rows are re-ordered randomly, then a dictionary hash could work.

def consolidate(input):
    unique = { }
    for row in input:
        id = row[0]
        if id not in unique:
            unique[id] = row
        else:
            for i in range(1, len(row)):
                unique[id][i] |= row[i]
    return unique.values()

This results in:-

[[120, 0, 1, 1, 0, 0],
 [163, 1, 0, 0, 0, 0],
 [117, 1, 1, 0, 0, 1],
 [189, 0, 0, 0, 1, 0]]

If you do want row sequence to be preserved then a little more work is needed:-

def consolidate(input):
    unique = { }
    sequence = 0

    for row in input:
        id = row[0]
        row = [sequence] + row
        sequence += 1
        if id not in unique:
            unique[id] = row
        else:
            for i in range(2, len(row)):
                unique[id][i] |= row[i]
    return [row[1:] for row in sorted(unique.values())]

This now results in:-

[[117, 1, 1, 0, 0, 1],
 [163, 1, 0, 0, 0, 0],
 [120, 0, 1, 1, 0, 0],
 [189, 0, 0, 0, 1, 0]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.