I have a grouped pandas dataframe. I want to aggregate multiple columns. For each column, there are multiple aggregate functions. This is pretty straightforward. The tricky part is that in each aggregate function, I want to access data in another column.
How would I go about doing this efficiently? Here's the code I already have:
import pandas
data = [
{
'id': 1,
'A': 1,
'B': 1,
'C': 1,
'D': 1,
'E': 1,
'F': 1,
},
{
'id': 1,
'A': 2,
'B': 2,
'C': 2,
'D': 2,
'E': 2,
'F': 2,
},
{
'id': 2,
'A': 3,
'B': 3,
'C': 3,
'D': 3,
'E': 3,
'F': 3,
},
{
'id': 2,
'A': 4,
'B': 4,
'C': 4,
'D': 4,
'E': 4,
'F': 4,
},
]
df = pandas.DataFrame.from_records(data)
def get_column(column, column_name):
return df.iloc[column.index][column_name]
def agg_sum_a_b(column_a):
return column_a.sum() + get_column(column_a, 'B').sum()
def agg_sum_a_b_divide_c(column_a):
return (column_a.sum() + get_column(column_a, 'B').sum()) / get_column(column_a, 'C').sum()
def agg_sum_d_divide_sum_e_f(column_d):
return column_d.sum() / (get_column(column_d, 'E').sum() + get_column(column_d, 'F').sum())
def multiply_then_sum(column_e):
return (column_e * get_column(column_e, 'F')).sum()
df_grouped = df.groupby('id')
df_agg = df_grouped.agg({
'A': [agg_sum_a_b, agg_sum_a_b_divide_c, 'sum'],
'D': [agg_sum_d_divide_sum_e_f, 'sum'],
'E': [multiply_then_sum]
})
This code produces this dataframe:
A D E
agg_sum_a_b agg_sum_a_b_divide_c sum agg_sum_d_divide_sum_e_f sum multiply_then_sum
id
1 6 2 3 0.5 3 5
2 14 2 7 0.5 7 25
Am I doing this correctly? Is there a better way of doing this? I find the way I access data in another column within the aggregate function a little awkward.
The real data and code I'm using has about 20 columns and around 40 aggregate functions. There could potentially be hundreds of groups as well with each group having hundreds of rows.
When I do this using the real data and aggregate functions, it can take several minutes which is too slow for my purposes. Any way to make this more efficient?
Edit: I'm using Python 3.6 and pandas 0.23.0 btw. Thanks!
Edit 2: Added an example where I don't call sum() on the columns.
sumfor your real case? this example can be changed because the function are quite simple but maybe the real case is more complexcolumn_a.sum()and use alwayssumon each column for each grouped id. If yes, for speed reason, you could first dodf_grouped = df.groupby('id').sum()and then work on the values already summed, instead of performingsumthree time on the column A in this example (I know it's not your question but it could help seeing the problem differently)