Grouping by a column and assigning different aggregation function for a certain column that is used in a group by

Question

I have a df:

field           category            2022-01-10      2022-01-17      2022-01-24    2022-01-31 
A               Chair               50              100             200           150
B               Chair               75              15              57            75   
C               Chair               0.57            0.64            0.4           0.47
A               Table               50              100             200           150
B               Table               75              15              57            75   
C               Table               0.57            0.64            0.4           0.47
A               Mirror              50              100             200           150
B               Mirror              75              15              57            75   
C               Mirror              0.57            0.64            0.4           0.47

I am trying to group by field or simply drop the category column and aggregate the values for each week so that the result would look like this:

field       2022-01-10      2022-01-17      2022-01-24    2022-01-31 
A           50+50+50        100+100+100     200+200+200   150+150+150
B           75+75+75        15+15+15        57+57+57      75+75+75
C           np.mean([0.57, 0.57, 0.57])     # each weeks average of field C

For fields A & B each week I would sum up the values for that week in different categories, but for field C I want the average.

I tried: df.groupby(['field']).agg({'A':sum, 'B':sum, 'C'.np.mean}) but this of course does not work because the fields arethe thing that's being grouped by and not the column.

I believe this problem can be sorted by somehow dropping an category column and then transforming the values based on field column but I am not sure how that works.

Corralien · Accepted Answer · 2022-02-23 13:40:59Z

1

Create a dict of functions to apply then flat your dataframe and group by field and variable (dates) columns. Use the first index of groupby to get the right function. Finally reshape your dataframe.

aggfuncs = {'A': lambda x: np.sum(x),
            'B': lambda x: np.sum(x),
            'C': lambda x: np.mean(x)}

out = df.melt(['field', 'category']).groupby(['field', 'variable'])['value'] \
        .apply(lambda x: aggfuncs[x.name[0]](x)).unstack() \
        .rename_axis(columns=None).reset_index()
print(out)

# Output
  field  2022-01-10  2022-01-17  2022-01-24  2022-01-31
0     A      150.00      300.00       600.0      450.00
1     B      225.00       45.00       171.0      225.00
2     C        0.57        0.64         0.4        0.47

answered Feb 23, 2022 at 13:40

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Raymond Kwok · Accepted Answer · 2022-02-23 13:41:11Z

1

{A, B} and {C} require different handling, so we can handle them separately

pd.concat([
    df[df['field'].isin(['A', 'B'])).groupby('field').sum(), #for AB
    df[df['field']=='C'].groupby('field').mean(),  #for C
])

edited Feb 23, 2022 at 13:41

answered Feb 23, 2022 at 13:28

Raymond Kwok

2,5612 gold badges11 silver badges11 bronze badges

4 Comments

Jonas Palačionis Over a year ago

This creates rows for each date column for some reason.

Jonas Palačionis Over a year ago

I ended up using df[df['field']=='C'].groupby('field').mean() instead in the second part.

Raymond Kwok Over a year ago

My bad. Please try again

Raymond Kwok Over a year ago

Yes. Agree with you.

Shubham Sharma · Accepted Answer · 2022-02-23 13:51:44Z

1

Here is one way to approach the problem:

s = df.set_index('field').select_dtypes(np.number)

out = s.drop('C').groupby(level=0).sum()
out.loc['C', :] = s.loc['C'].mean()

Result:

print(out)

       2022-01-10  2022-01-17  2022-01-24  2022-01-31
field                                                
A          150.00      300.00       600.0      450.00
B          225.00       45.00       171.0      225.00
C            0.57        0.64         0.4        0.47

answered Feb 23, 2022 at 13:51

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

2 Comments

Jonas Palačionis Over a year ago

Does the .select_dtypes(np.number) takes everything but the floats like 0.57? Asking because this solution is clear but might be limited to what the df is.

Shubham Sharma Over a year ago

@JonasPalačionis select_dtypes(np.number) will select the numeric only columns thus it will automatically drop the category column.

Collectives™ on Stack Overflow

Grouping by a column and assigning different aggregation function for a certain column that is used in a group by

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related