2

I have a grouped pandas dataframe. I want to aggregate multiple columns. For each column, there are multiple aggregate functions. This is pretty straightforward. The tricky part is that in each aggregate function, I want to access data in another column.

How would I go about doing this efficiently? Here's the code I already have:

import pandas

data = [
    {
        'id': 1,
        'A': 1,
        'B': 1,
        'C': 1,
        'D': 1,
        'E': 1,
        'F': 1,
    },
    {
        'id': 1,
        'A': 2,
        'B': 2,
        'C': 2,
        'D': 2,
        'E': 2,
        'F': 2,
    },
    {
        'id': 2,
        'A': 3,
        'B': 3,
        'C': 3,
        'D': 3,
        'E': 3,
        'F': 3,
    },
    {
        'id': 2,
        'A': 4,
        'B': 4,
        'C': 4,
        'D': 4,
        'E': 4,
        'F': 4,
    },
]

df = pandas.DataFrame.from_records(data)


def get_column(column, column_name):
    return df.iloc[column.index][column_name]


def agg_sum_a_b(column_a):
    return column_a.sum() + get_column(column_a, 'B').sum()


def agg_sum_a_b_divide_c(column_a):
    return (column_a.sum() + get_column(column_a, 'B').sum()) / get_column(column_a, 'C').sum()


def agg_sum_d_divide_sum_e_f(column_d):
    return column_d.sum() / (get_column(column_d, 'E').sum() + get_column(column_d, 'F').sum())


def multiply_then_sum(column_e):
    return (column_e * get_column(column_e, 'F')).sum()


df_grouped = df.groupby('id')
df_agg = df_grouped.agg({
    'A': [agg_sum_a_b, agg_sum_a_b_divide_c, 'sum'],
    'D': [agg_sum_d_divide_sum_e_f, 'sum'],
    'E': [multiply_then_sum]
})

This code produces this dataframe:

             A                                                 D                     E    
   agg_sum_a_b agg_sum_a_b_divide_c sum agg_sum_d_divide_sum_e_f sum multiply_then_sum
id                                                                  
1            6                    2   3                      0.5   3                 5
2           14                    2   7                      0.5   7                25

Am I doing this correctly? Is there a better way of doing this? I find the way I access data in another column within the aggregate function a little awkward.

The real data and code I'm using has about 20 columns and around 40 aggregate functions. There could potentially be hundreds of groups as well with each group having hundreds of rows.

When I do this using the real data and aggregate functions, it can take several minutes which is too slow for my purposes. Any way to make this more efficient?

Edit: I'm using Python 3.6 and pandas 0.23.0 btw. Thanks!
Edit 2: Added an example where I don't call sum() on the columns.

4
  • do you always perform sum for your real case? this example can be changed because the function are quite simple but maybe the real case is more complex Commented Jul 16, 2018 at 20:26
  • @Ben.T What do you mean? My example aggregate functions are a little simplified, but the real aggregate functions are always some combination of mathematical functions (+, -, /, *, sin, cos, log, exp) over multiple columns. Commented Jul 16, 2018 at 20:34
  • 1
    I mean: you do for example column_a.sum() and use always sum on each column for each grouped id. If yes, for speed reason, you could first do df_grouped = df.groupby('id').sum() and then work on the values already summed, instead of performing sum three time on the column A in this example (I know it's not your question but it could help seeing the problem differently) Commented Jul 16, 2018 at 20:40
  • 1
    @Ben.T Ah, I see. For some aggregate functions yes, for other aggregate functions no (there is one aggregate function that multiples series together before summing), but this is a good idea. Thanks! I'm still open to other suggestions too! Commented Jul 16, 2018 at 20:42

1 Answer 1

1

First I think you need more apply than agg to access different columns at once. Here is an idea how to change a bit what you want to do. Let's first create a function regrouping the operation you want to do and return them as a list of results:

def operations_to_perfom (df_g):

    df_g_sum = df_g.sum() #can do the same with mean, min, max ...

    # return all the operation you want 
    return  [ df_g_sum['A'] + df_g_sum['B'], 
              (df_g_sum['A'] + df_g_sum['B'])/df_g_sum['C'], 
              df_g_sum['A'], 
              float(df_g_sum['D'])/(df_g_sum['E']+df_g_sum['F']),
              (df_g['E']*df_g['F']).sum() ]

#use apply to create a serie with id as index and a list of agg
df_values = df.groupby('id').apply(operations_to_perfom)

# now create the result dataframe from df_values with tolist() and index
df_agg = pd.DataFrame( df_values.tolist(), index=df_values.index, 
         columns=pd.MultiIndex.from_arrays([['A']*3+['D']+['E'], 
                 ['agg_sum_a_b', 'agg_sum_a_b_div_c' ,'sum', 'agg_sum_d_div_sum_e_f', 'e_mult_f']]))

and df_agg looks like:

             A                                           D        E
   agg_sum_a_b agg_sum_a_b_div_c sum agg_sum_d_div_sum_e_f e_mult_f
id                                                                 
1            6                 2   3                   0.5        5
2           14                 2   7                   0.5       25
Sign up to request clarification or add additional context in comments.

1 Comment

df_g_sum = df_g[List of columns you care about].sum() could be of use to avoid summing over irrelevant columns. (There are no such columns in the example, but a 'G' column over which no aggregation was needed would have an unnecessary sum run on it.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.