Merge rows with duplications in pandas

Question

Hello I have a dataframe such as :

   species family     Events       groups
1     SP1      A      10,22           G1
2     SP1      B          7           G2
3     SP1    C,D  4,5,6,1,3  G3,G4,G5,G6
4     SP2      A      22,10           G1
5     SP2    D,C  6,5,4,3,1  G4,G6,G5,G3
6     SP3      C  4,5,3,6,1     G3,G6,G5
7     SP3      E          7           G2
8     SP3      A         10           G1
9     SP4      C       7,22          G12

and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species).

For instance I will merge the rows :

species family    Events      groups
SP1      A        10,22       G1
species family    Events      groups
SP2      A        22,10       G1
species family    Events      groups
SP3      A        10          G1

into

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1

SO if I do the same for each row I should get an expected output as :

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1
SP1,SP3      B,E       7           G2
SP1,SP2,SP3  C,D       1,3,4,5,6   G3,G4,G6,G5
SP4          C         7,22        G12

Note that SP4 has not been merged with any rows since its group was not present in any other rows.

Does someone have an idea please ? Thank you very much for your help and time

Here is the dataframe in dic format it can helps:

{'species': {1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP2', 5: 'SP2', 6: 'SP3', 7: 'SP3', 8: 'SP3', 9: 'SP4'}, 'family': {1: 'A', 2: 'B', 3: 'C,D', 4: 'A', 5: 'D,C', 6: 'C', 7: 'E', 8: 'A', 9: 'C'}, 'Events': {1: '10,22', 2: '7', 3: '4,5,6,1,3', 4: '22,10', 5: '6,5,4,3,1', 6: '4,5,3,6,1', 7: '7', 8: '10', 9: '7,22'}, 'groups': {1: 'G1', 2: 'G2', 3: 'G3,G4,G5,G6', 4: 'G1', 5: 'G4,G6,G5,G3', 6: 'G3,G6,G5', 7: 'G2', 8: 'G1', 9: 'G12'}}

Thanks for posting the example DataFrame in a copy/paste format. That's really nice and happends too rarely. — moritz
– moritz, Commented Mar 5, 2021 at 9:50

Serge Ballesta · Accepted Answer · 2021-03-05 11:09:26Z

1

The trick here is to split the cells on comma, handles them as sets, and finaly join them back. Those are indeed non vectorizable operations, but it leads to (rather) simple code using apply.

I would first compute a containing group for each row:

g = df['groups'].apply(lambda x: set(x.split(',')))   # explode into sets
# keep the larger set from g containing the current one and make it back a string
g2 = g.apply(lambda s: ','.join(sorted(
    g[g.apply(lambda x: x.issuperset(s))].max())))

With the sample, it gives:

1             G1
2             G2
3    G3,G4,G5,G6
4             G1
5    G3,G4,G5,G6
6    G3,G4,G5,G6
7             G2
8             G1
9            G12
Name: groups, dtype: object

We can now use it to groupby the dataframe and aggregate the groups using the set trick:

resul = df[['species', 'family', 'Events']].groupby(g2).agg(
    lambda x: ','.join(sorted(set((i for j in x for i in j.split(',')))))
    ).reset_index().reindex(df.columns)

And we get as expected:

       species family     Events       groups
0  SP1,SP2,SP3      A      10,22           G1
1          SP4      C       22,7          G12
2      SP1,SP3    B,E          7           G2
3  SP1,SP2,SP3    C,D  1,3,4,5,6  G3,G4,G5,G6

answered Mar 5, 2021 at 11:09

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

chippycentra Over a year ago

Hello, I tried your code but in resul I only get NaN values

chippycentra Over a year ago

I remove .reindex(df.columns) and i works now

Collectives™ on Stack Overflow

Merge rows with duplications in pandas

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related