1

I have the following table

event_name | score | date      | flag | 
event_1    | 123   | 12APR2018 |  0   |
event_1    | 34    | 05JUN2019 |  0   |
event_1    | 198   | 08APR2020 |  0   |
event_2    | 3     | 14SEP2019 |  0   |
event_2    | 34    | 22DEC2019 |  1   |
event_2    | 90    | 17FEB2020 |  0   | 
event_3    | 772   | 19MAR2021 |  1   |

And I want to obtain

event_name | sum_score | date_flag_1 | 
event_1    | 355       |             | 
event_2    | 127       | 22DEC2019   | 
event_3    | 772       | 19MAR2021   | 

where sum_score is the sum of column score for the corresponding event and date_flag_1 is the first date when flag = 1 for the corresponding event. If flag = 0 for all the rows of the current event, date_flag_1 should be missing

I suppose that the code should look something like

df_agg = df.groupby('event_name').agg({'score': 'sum', ['date', 'flag']: my_custom_function})
df_agg.columns = ['event_name', 'sum_score', 'date_flag_1']

However, I am not sure how should I implement my_custom_function, which would be a custom aggregation function that uses two columns instead of one (like other aggregation function). Please help

1 Answer 1

3

Aggregate twice and concat the results. The second aggregation you can subset then use the builtin GroupBy.first

import pandas as pd

pd.concat([df.groupby('event_name')['score'].sum(),
           df[df.flag.eq(1)].groupby('event_name')['date'].first().rename('date_flag_1')], 
          axis=1)

#            score date_flag_1
#event_name                   
#event_1       355         NaN
#event_2       127   22DEC2019
#event_3       772   19MAR2021

For illustration, this can be done with a single agg call; however it will be very slow because this requires a lambda x: which will be calcualted as a slow loop over the groups (as opposed to vectorized/cythonized built-in GroupBy operations).

Because .agg only acts on a single Series the hacky work-around is to create a function that accepts both the Series and the DataFrame. You use the Series index to subset the DataFrame (you must have a non-duplicated index for this to work properly) allowing you to then do aggregations that can use multiple columns. This is both overly complicated and slow so I wouldn't do it.

def get_first_date(s, df):
    # rows within group where `s==1`
    res = df.loc[s[s.eq(1)].index, 'date'].dropna()

    if not res.empty:
        return res.iloc[0]
    else:
        return np.NaN

df.groupby('event_name').agg({'score': 'sum', 
                              'flag': lambda x: get_first_date(x, df)})

#            score       flag
#event_name                  
#event_1       355        NaN
#event_2       127  22DEC2019
#event_3       772  19MAR2021
Sign up to request clarification or add additional context in comments.

5 Comments

Elegant solution!
Thank you very much! Is it also possible to do this process with a single aggregating function (wihtout concatenation)?
@pentavol It can be done, but it's very hacky and complicated. agg really only works column by column, so you'd want to use .apply, but .apply doesn't play well with multiple aggregations. The other issue is that .apply(my_func) will be slow so even though the above might look bad because of two separate groupby calls, it will be fast because the built-in methods are optimized
@ALollz - thanks for the explanations. I didn't even know apply was slow, I only knew iterrows is very slow. Also, I will accept you answer in 4 minutes, currently I am not allowed to do it
@pentavol well not all apply calls are equal. I think stuff like df.groupby('event_name')['score'].apply(sum) will default to use the very fast built-in implementation of DataFrame.GroupBy.Sum. The real problem is df.apply(lambda x: ... or df.apply(my_custom_func) because those are calculated as slow loops over the groups. Might not be bad if you have a few hundred groups but with very big data it can be a big burden

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.