3

Is there a way in Pandas to create a new column that is a function of two column's aggregation, so that for any arbitrary grouping it preserves the function? This would be functionally similar to creating a calculated column in excel and pivoting by labels.

df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':[1,2]*5,'B':[4,5]*5})
df1['C'] = df1.apply(lambda x: x['A']/x['B'],axis=1)
pd.pivot_table(df1,index='lab',{'A':sum,'B':sum,'C':lambda x: x['A']/x['B']})

should return: |lab|A B|C| |----|---|---| |lab1|5 |20|.25| |lab2|10|25 |.4|

i'd like to aggregate by 'lab' (or any combination of labels) and have the dataframe return the aggregation without having to re-define the column calculation. I realize this is trivial to manually code, but it's repetitive when you have many columns.

1
  • 7
    Would you mind posting your expected result? It isn't clear from your question. Commented Apr 9, 2018 at 21:09

1 Answer 1

4

There are two ways you can do this using apply or agg:

import numpy as np
import pandas as pd

# Method 1
df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['C'].unique()[0]})).reset_index()

# Method 2
df1.groupby('lab').agg({'A': 'sum',
                    'B': 'sum',
                    'C': lambda x: np.unique(x)}).reset_index()

# output
     lab  A    B   C
0   lab1  5    20 0.25
1   lab2  10   25 0.40
Sign up to request clarification or add additional context in comments.

5 Comments

thanks! could you please show how to use agg() that references specific columns? I don't think this solution will work for other examples, such as df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':np.random.randint(1,11,10),'B':np.random.randint(1,11,10)})
In agg, we pass a dictionary. The keys of the dictionary is the name of the column from df and value is the function. A function can be passed as a string or in custom lambda form. It should work on your new example too.
got it. this produces the output i want: df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['A'].sum()/df['B'].sum()})).reset_index() however, im pretty unclear what Pandas is actually doing here.
@Jonathan umm... are you after: df1.groupby('lab', as_index=False).sum().assign(C=lambda L: L.A / L.B) ?
yes, partially. like i said in my description i know this is possible to do like this. I am looking for a practical way to encapsulate the aggregation logic in the column itself. I don't want to have to re-assign variables after each transformation. Ideally, the dataframe would reference that lambda function defined elsewhere, and use the generic .agg() in one go without using apply

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.