4

Hello I have a dataframe such as :

   species family     Events       groups
1     SP1      A      10,22           G1
2     SP1      B          7           G2
3     SP1    C,D  4,5,6,1,3  G3,G4,G5,G6
4     SP2      A      22,10           G1
5     SP2    D,C  6,5,4,3,1  G4,G6,G5,G3
6     SP3      C  4,5,3,6,1     G3,G6,G5
7     SP3      E          7           G2
8     SP3      A         10           G1
9     SP4      C       7,22          G12

and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species).

For instance I will merge the rows :

species family    Events      groups
SP1      A        10,22       G1
species family    Events      groups
SP2      A        22,10       G1
species family    Events      groups
SP3      A        10          G1

into

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1

SO if I do the same for each row I should get an expected output as :

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1
SP1,SP3      B,E       7           G2
SP1,SP2,SP3  C,D       1,3,4,5,6   G3,G4,G6,G5
SP4          C         7,22        G12 

Note that SP4 has not been merged with any rows since its group was not present in any other rows.

Does someone have an idea please ? Thank you very much for your help and time

Here is the dataframe in dic format it can helps:

{'species': {1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP2', 5: 'SP2', 6: 'SP3', 7: 'SP3', 8: 'SP3', 9: 'SP4'}, 'family': {1: 'A', 2: 'B', 3: 'C,D', 4: 'A', 5: 'D,C', 6: 'C', 7: 'E', 8: 'A', 9: 'C'}, 'Events': {1: '10,22', 2: '7', 3: '4,5,6,1,3', 4: '22,10', 5: '6,5,4,3,1', 6: '4,5,3,6,1', 7: '7', 8: '10', 9: '7,22'}, 'groups': {1: 'G1', 2: 'G2', 3: 'G3,G4,G5,G6', 4: 'G1', 5: 'G4,G6,G5,G3', 6: 'G3,G6,G5', 7: 'G2', 8: 'G1', 9: 'G12'}}
2
  • 1
    Thanks for posting the example DataFrame in a copy/paste format. That's really nice and happends too rarely. Commented Mar 5, 2021 at 9:50
  • @moritz it is the least I can do eheh Commented Mar 5, 2021 at 10:51

1 Answer 1

1

The trick here is to split the cells on comma, handles them as sets, and finaly join them back. Those are indeed non vectorizable operations, but it leads to (rather) simple code using apply.

I would first compute a containing group for each row:

g = df['groups'].apply(lambda x: set(x.split(',')))   # explode into sets
# keep the larger set from g containing the current one and make it back a string
g2 = g.apply(lambda s: ','.join(sorted(
    g[g.apply(lambda x: x.issuperset(s))].max())))

With the sample, it gives:

1             G1
2             G2
3    G3,G4,G5,G6
4             G1
5    G3,G4,G5,G6
6    G3,G4,G5,G6
7             G2
8             G1
9            G12
Name: groups, dtype: object

We can now use it to groupby the dataframe and aggregate the groups using the set trick:

resul = df[['species', 'family', 'Events']].groupby(g2).agg(
    lambda x: ','.join(sorted(set((i for j in x for i in j.split(',')))))
    ).reset_index().reindex(df.columns)

And we get as expected:

       species family     Events       groups
0  SP1,SP2,SP3      A      10,22           G1
1          SP4      C       22,7          G12
2      SP1,SP3    B,E          7           G2
3  SP1,SP2,SP3    C,D  1,3,4,5,6  G3,G4,G5,G6
Sign up to request clarification or add additional context in comments.

2 Comments

Hello, I tried your code but in resul I only get NaN values
I remove .reindex(df.columns) and i works now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.