Pandas Dataframe Aggregation

Question

I have the following dataframe (I didnt include an index here, but obvisouly there is also an index)

ID_1	ID_2	Count
55	62	1000
62	55	1200
...	...	...

Now I would like to aggregate those two columns, since I do not care if the ID is in the column ID_1 or in ID_2.

I would like to get the following result:

ID_1	ID_2	Count
55	62	2200
62	55	2200
...	...	...

That means that I want to sum the Count column over all the rows in my dataframe where two IDs are the same (doesnt care if they are in ID_1 column or ID_2 column).

I thought about grouping the dataframe, but that did not work properly.

I am happy for any help!

Vishnudev Krishnadas · Accepted Answer · 2021-12-19 12:56:58Z

2

Sort the ID columns row wise

df[['ID_1', 'ID_2']] = np.sort(df[['ID_1', 'ID_2']], axis=1)

Groupby the ID columns now

df.groupby(['ID_1', 'ID_2']).transform(sum)

answered Dec 19, 2021 at 12:56

Vishnudev Krishnadas

11k2 gold badges29 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Corralien · Accepted Answer · 2021-12-19 12:50:56Z

1

Create virtual groups:

make_group = lambda x: tuple(sorted(x))

df['Count'] = df.groupby(df[['ID_1', 'ID_2']].apply(make_group, axis=1))['Count'] \
                .transform('sum')

Output:

>>> df
   ID_1  ID_2  Count
0    55    62   2200
1    62    55   2200

# virtual groups
>>> df[['ID_1', 'ID_2']].apply(make_group, axis=1)
0    (55, 62)
1    (55, 62)
dtype: object

answered Dec 19, 2021 at 12:50

Corralien

121k8 gold badges44 silver badges69 bronze badges

1 Comment

Vishnudev Krishnadas Over a year ago

HINT: I agree with the approach, but this will slower for larger datasets.

wwnde · Accepted Answer · 2021-12-19 13:29:52Z

0

sort row values using np.sort , groupby and aggregate. Code below

df=df.assign(Count=pd.DataFrame(np.sort(df.values), columns=df.columns).groupby(['ID_1','ID_2']).transform('sum'))

Alternatively use agg('sort') to sort and then groupby

df[df.filter(regex='^ID').columns] =df.filter(regex='^ID').agg('sort')
df['Count']=df.groupby(['ID_1','ID_2']).transform('sum')



    ID_1  ID_2  Count
0    55    62   2200
1    62    55   2200

edited Dec 19, 2021 at 13:29

answered Dec 19, 2021 at 13:07

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Pandas Dataframe Aggregation

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related