Custom aggregation function using 2 columns in pandas

Question

I have the following table

event_name | score | date      | flag | 
event_1    | 123   | 12APR2018 |  0   |
event_1    | 34    | 05JUN2019 |  0   |
event_1    | 198   | 08APR2020 |  0   |
event_2    | 3     | 14SEP2019 |  0   |
event_2    | 34    | 22DEC2019 |  1   |
event_2    | 90    | 17FEB2020 |  0   | 
event_3    | 772   | 19MAR2021 |  1   |

And I want to obtain

event_name | sum_score | date_flag_1 | 
event_1    | 355       |             | 
event_2    | 127       | 22DEC2019   | 
event_3    | 772       | 19MAR2021   |

where sum_score is the sum of column score for the corresponding event and date_flag_1 is the first date when flag = 1 for the corresponding event. If flag = 0 for all the rows of the current event, date_flag_1 should be missing

I suppose that the code should look something like

df_agg = df.groupby('event_name').agg({'score': 'sum', ['date', 'flag']: my_custom_function})
df_agg.columns = ['event_name', 'sum_score', 'date_flag_1']

However, I am not sure how should I implement my_custom_function, which would be a custom aggregation function that uses two columns instead of one (like other aggregation function). Please help

ALollz · Accepted Answer · 2021-04-02 15:58:29Z

3

Aggregate twice and concat the results. The second aggregation you can subset then use the builtin GroupBy.first

import pandas as pd

pd.concat([df.groupby('event_name')['score'].sum(),
           df[df.flag.eq(1)].groupby('event_name')['date'].first().rename('date_flag_1')], 
          axis=1)

#            score date_flag_1
#event_name                   
#event_1       355         NaN
#event_2       127   22DEC2019
#event_3       772   19MAR2021

For illustration, this can be done with a single agg call; however it will be very slow because this requires a lambda x: which will be calcualted as a slow loop over the groups (as opposed to vectorized/cythonized built-in GroupBy operations).

Because .agg only acts on a single Series the hacky work-around is to create a function that accepts both the Series and the DataFrame. You use the Series index to subset the DataFrame (you must have a non-duplicated index for this to work properly) allowing you to then do aggregations that can use multiple columns. This is both overly complicated and slow so I wouldn't do it.

def get_first_date(s, df):
    # rows within group where `s==1`
    res = df.loc[s[s.eq(1)].index, 'date'].dropna()

    if not res.empty:
        return res.iloc[0]
    else:
        return np.NaN

df.groupby('event_name').agg({'score': 'sum', 
                              'flag': lambda x: get_first_date(x, df)})

#            score       flag
#event_name                  
#event_1       355        NaN
#event_2       127  22DEC2019
#event_3       772  19MAR2021

edited Apr 2, 2021 at 15:58

answered Apr 2, 2021 at 15:40

ALollz

59.7k7 gold badges74 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Zingerella Over a year ago

Elegant solution!

pentavol Over a year ago

Thank you very much! Is it also possible to do this process with a single aggregating function (wihtout concatenation)?

ALollz Over a year ago

@pentavol It can be done, but it's very hacky and complicated. agg really only works column by column, so you'd want to use .apply, but .apply doesn't play well with multiple aggregations. The other issue is that .apply(my_func) will be slow so even though the above might look bad because of two separate groupby calls, it will be fast because the built-in methods are optimized

pentavol Over a year ago

@ALollz - thanks for the explanations. I didn't even know apply was slow, I only knew iterrows is very slow. Also, I will accept you answer in 4 minutes, currently I am not allowed to do it

ALollz Over a year ago

@pentavol well not all apply calls are equal. I think stuff like df.groupby('event_name')['score'].apply(sum) will default to use the very fast built-in implementation of DataFrame.GroupBy.Sum. The real problem is df.apply(lambda x: ... or df.apply(my_custom_func) because those are calculated as slow loops over the groups. Might not be bad if you have a few hundred groups but with very big data it can be a big burden

Collectives™ on Stack Overflow

Custom aggregation function using 2 columns in pandas

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related