Group by with aggregation function as new field in pandas

Question

If I do the following group by on a mysql table

SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1

what I get is a table with three columns

col1 col2 agg_col

How can I do the same on a pandas dataframe?

Suppose I have a Dataframe that has three columns col1 col2 and col3. Group by operation

grouped = my_df.groupby('col1')

will returned the data grouped by col1

Also

agg_col_series = grouped.col2.size() * grouped.col3.nunique()

will return the aggregated column equivalent to the one on the sql query. But how can I add this on the grouped dataframe?

Might this help you manipulating groupby objects? stackoverflow.com/questions/10373660/… — Tkanno
– Tkanno, Commented Jul 1, 2017 at 12:55
Are you sure your SQL produces three columns? IMO col2 is missing in the SELECT and in the GROUP BY clauses... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jul 1, 2017 at 14:00
Can you provide a small reproducible data set and desired data set? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jul 1, 2017 at 14:01
Agreed with @MaxU, your SQL should only output 2 columns as you multiply two aggregates in SELECT. — Parfait
– Parfait, Commented Jul 1, 2017 at 17:06

NickBraunagel · Accepted Answer · 2017-07-01 15:07:52Z

1

We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series:

agg_col_series.reset_index(name='agg_col')

Full example with dummy data:

import random
import pandas as pd

col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]

df = pd.DataFrame(data={
        'col1': col1,
        'col2': col2,
        'col3': col3,
    })

grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()

print agg_col_series.reset_index(name='agg_col')

index   col1  agg_col
0       1    15566
1       2    20056
2       3    17313
3       4    17304
4       5    16380

answered Jul 1, 2017 at 15:07

NickBraunagel

1,6091 gold badge19 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Apostolos Over a year ago

Thank you Nick!!

Scott Boston · Accepted Answer · 2017-07-01 15:02:52Z

1

Let's use groupby with a lambda function that uses size and nunique then rename the series to 'agg_col' and reset_index to get a dataframe.

import pandas as pd
import numpy as np

np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
                   'Col2':np.random.randint(1000,9999,50),
                   'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})

df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()

Output:

  Col1  agg_col
0    A      120
1    B       96
2    C      190

answered Jul 1, 2017 at 15:02

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

2 Comments

Apostolos Over a year ago

Thnaks both of you. The only reason I am choosing @nick-braunagel answer is because he came first and because he has lower reputation in order to increase it :)

Scott Boston Over a year ago

@apostolos. Great! Glad it worked for you. Thanks for the upvote.

Collectives™ on Stack Overflow

Group by with aggregation function as new field in pandas

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related