Aggregate multiple columns in a dataframe based on custom functions

Question

Afternoon All,

I have been trying to resolve this for awhile, any help would be appreciated.

Here is my dataframe:

Channel state       rfq_qty
A        Done       10
B        Tied Done  10
C        Done       10
C        Done       10
C        Done       10
C        Tied Done  10
B        Done       10
B        Done       10

I would like to:

Group by channel, then state

Sum the rfq_qty for each channel

Count the occurences of each 'done' string in state ('Done' is treated the same as 'Tied Done' i.e. anything with 'done' in it)

Display the channels rfq_qty as a percentage of the total number of rfq_qty (80)

Channel state   rfq_qty Percentage
A         1       10    0.125
B         3       30    0.375
C         4       40    0.5

I have attempted this with the following:

df_Done = df[
                (
                    df['state']=='Done'
                ) 
                | 
                (
                    df['state'] == 'Tied Done'
                )
            ][['Channel','state','rfq_qty']]

df_Done['Percentage_Qty']= df_Done['rfq_qty']/df_Done['rfq_qty'].sum()
df_Done['Done_Trades']= df_Done['state'].count()

display(
        df_Done[
                (df_Done['Channel'] != 0)
               ].groupby(['Channel'])['Channel','Count of Done','rfq_qty','Percentage_Qty'].sum().sort_values(['rfq_qty'], ascending=False)
       )

Works but looks convoluted. Any improvements?

jezrael · Accepted Answer · 2018-03-14 10:01:43Z

1

I think you can use:

first filter by isin and loc
groupby and aggregate by agg with tuples of new columns names and functions
add Percentage by divide by div and sum
last if necessary sort_values by rfq_qty

df_Done = df.loc[df['state'].isin(['Done', 'Tied Done']), ['Channel','state','rfq_qty']]

#if want filter all values contains Done
#df_Done = df[df['state'].str.contains('Done')]

#if necessary filter out Channel == 0
#mask = (df['Channel'] != 0) & df['state'].isin(['Done', 'Tied Done'])
#df_Done = df.loc[mask, ['Channel','state','rfq_qty']]

d = {('rfq_qty', 'sum'), ('Done_Trades','size')}
df = df_Done.groupby('Channel')['rfq_qty'].agg(d).reset_index()
df['Percentage'] = df['rfq_qty'].div(df['rfq_qty'].sum())
df = df.sort_values('rfq_qty')
print (df)
  Channel  Done_Trades  rfq_qty  Percentage
0       A            1       10       0.125
1       B            3       30       0.375
2       C            4       40       0.500

edited Mar 14, 2018 at 10:01

answered Mar 14, 2018 at 9:28

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Peter Lucas Over a year ago

Hey Jezzrael. Thanks for that. When I try to sirt on the sum column it fails to sort from largest to smallest. df.sort_values(['sum'], ascending=False)

jezrael Over a year ago

@PeterLucas - just remove , ascending=False

Peter Lucas Over a year ago

Perfect, case issue on column header. Thanks again!

jezrael Over a year ago

@jpp - Hmmm, In my opinion if OP use filtering first and then working with filtered df_Done DataFrame then it is no problem.

jezrael Over a year ago

@jpp - I agree, so added comment df_Done = df[df['state'].str.contains('Done')]

|

jpp · Accepted Answer · 2018-03-14 09:39:41Z

One way is to use a single df.groupby.agg and rename columns:

import pandas as pd

df = pd.DataFrame({'Channel': ['A', 'B', 'C', 'C', 'C', 'C', 'B', 'B'],
                   'state': ['Done', 'Tied Done', 'Done', 'Done', 'Done', 'Tied Done', 'Done', 'Done'],
                   'rfq_qty': [10, 10, 10, 10, 10, 10, 10, 10]})

agg_funcs = {'state': lambda x: x[x.str.contains('Done')].count(),
             'rfq_qty': ['sum', lambda x: x.sum() / df['rfq_qty'].sum()]}

res = df.groupby('Channel').agg(agg_funcs).reset_index()
res.columns = ['Channel', 'state', 'rfq_qty', 'Percentage']

#   Channel  state  rfq_qty  Percentage
# 0       A      1       10       0.125
# 1       B      3       30       0.375
# 2       C      4       40       0.500

This isn't the most efficient way, since it relies of non-vectorised aggregations, but it may be a good option if it is performant for your use case.

Collectives™ on Stack Overflow

Aggregate multiple columns in a dataframe based on custom functions

2 Answers 2

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related