I have a dataframe in following form:
+---------+---------+-------+-------+-----------------+
| country | payment | type | err | email |
+---------+---------+-------+-------+-----------------+
| AU | visa | type1 | OK | [email protected] |
| DE | paypal | type1 | OK | [email protected] |
| AU | visa | type2 | ERROR | [email protected] |
| US | visa | type2 | OK | [email protected] |
| FR | visa | type1 | OK | [email protected] |
| FR | visa | type1 | ERROR | [email protected] |
+---------+---------+-------+-------+-----------------+
df = pd.DataFrame({'country':['AU','DE','AU','US','FR','FR'],
'payment':['visa','paypal','visa','visa','visa','visa'],
'type':['type1','type1','type2','type2','type1','type1'],
'err':['OK','OK','ERROR','OK','OK','ERROR'],
'email': ['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]'] })
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values).
num_errors_per_unique_email - Average number of errors per unique email for this group,
num_type1_per_unique_email .. num_type3_per_unique_email - Average number of type per unique email for this group.
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_errors_per_unique_email | num_type1_per_unique_email | num_type2_per_unique_email | num_type3_per_unique_email |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
| paypal | DE | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | AU | 2 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 1 | 1 | 2 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------------------------+----------------------------+----------------------------+----------------------------+
Thanks to @anky's solution (get dummies, create the group, join the size with sum) I'm able to get first part of task.And receive this:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type']],prefix=['num'])).groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
+---------+---------+-----------------+------------+-----------+-----------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 |
+---------+---------+-----------------+------------+-----------+-----------+
| paypal | DE | 1 | 0 | 1 | 0 |
| visa | AU | 2 | 1 | 1 | 1 |
| visa | FR | 2 | 1 | 2 | 0 |
| visa | US | 1 | 0 | 0 | 1 |
+---------+---------+-----------------+------------+-----------+-----------+
But I stuck how to properly add columns like 'num_errors_per_unique_email' and 'num_type_per_unique_email'..
Appreciate any help.