Pandas create a custom groupby aggregation for column

Question

Is there a way in Pandas to create a new column that is a function of two column's aggregation, so that for any arbitrary grouping it preserves the function? This would be functionally similar to creating a calculated column in excel and pivoting by labels.

df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':[1,2]*5,'B':[4,5]*5})
df1['C'] = df1.apply(lambda x: x['A']/x['B'],axis=1)
pd.pivot_table(df1,index='lab',{'A':sum,'B':sum,'C':lambda x: x['A']/x['B']})

should return: |lab|A B|C| |----|---|---| |lab1|5 |20|.25| |lab2|10|25 |.4|

i'd like to aggregate by 'lab' (or any combination of labels) and have the dataframe return the aggregation without having to re-define the column calculation. I realize this is trivial to manually code, but it's repetitive when you have many columns.

Would you mind posting your expected result? It isn't clear from your question. — cs95
– cs95, Commented Apr 9, 2018 at 21:09

YOLO · Accepted Answer · 2018-04-09 22:01:41Z

4

There are two ways you can do this using apply or agg:

import numpy as np
import pandas as pd

# Method 1
df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['C'].unique()[0]})).reset_index()

# Method 2
df1.groupby('lab').agg({'A': 'sum',
                    'B': 'sum',
                    'C': lambda x: np.unique(x)}).reset_index()

# output
     lab  A    B   C
0   lab1  5    20 0.25
1   lab2  10   25 0.40

edited Apr 9, 2018 at 22:01

answered Apr 9, 2018 at 21:51

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jonathan Over a year ago

thanks! could you please show how to use agg() that references specific columns? I don't think this solution will work for other examples, such as df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':np.random.randint(1,11,10),'B':np.random.randint(1,11,10)})

YOLO Over a year ago

In agg, we pass a dictionary. The keys of the dictionary is the name of the column from df and value is the function. A function can be passed as a string or in custom lambda form. It should work on your new example too.

Jonathan Over a year ago

got it. this produces the output i want: df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['A'].sum()/df['B'].sum()})).reset_index() however, im pretty unclear what Pandas is actually doing here.

Jon Clements Over a year ago

@Jonathan umm... are you after: df1.groupby('lab', as_index=False).sum().assign(C=lambda L: L.A / L.B) ?

Jonathan Over a year ago

yes, partially. like i said in my description i know this is possible to do like this. I am looking for a practical way to encapsulate the aggregation logic in the column itself. I don't want to have to re-assign variables after each transformation. Ideally, the dataframe would reference that lambda function defined elsewhere, and use the generic .agg() in one go without using apply

Collectives™ on Stack Overflow

Pandas create a custom groupby aggregation for column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related