3

If I do the following group by on a mysql table

SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1

what I get is a table with three columns

col1 col2 agg_col

How can I do the same on a pandas dataframe?

Suppose I have a Dataframe that has three columns col1 col2 and col3. Group by operation

grouped = my_df.groupby('col1')

will returned the data grouped by col1

Also

agg_col_series = grouped.col2.size() * grouped.col3.nunique()

will return the aggregated column equivalent to the one on the sql query. But how can I add this on the grouped dataframe?

5
  • Might this help you manipulating groupby objects? stackoverflow.com/questions/10373660/… Commented Jul 1, 2017 at 12:55
  • Also here stackoverflow.com/questions/29082412/… Commented Jul 1, 2017 at 13:02
  • 1
    Are you sure your SQL produces three columns? IMO col2 is missing in the SELECT and in the GROUP BY clauses... Commented Jul 1, 2017 at 14:00
  • Can you provide a small reproducible data set and desired data set? Commented Jul 1, 2017 at 14:01
  • Agreed with @MaxU, your SQL should only output 2 columns as you multiply two aggregates in SELECT. Commented Jul 1, 2017 at 17:06

2 Answers 2

1

We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series:

agg_col_series.reset_index(name='agg_col')

Full example with dummy data:

import random
import pandas as pd

col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]

df = pd.DataFrame(data={
        'col1': col1,
        'col2': col2,
        'col3': col3,
    })

grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()

print agg_col_series.reset_index(name='agg_col')

index   col1  agg_col
0       1    15566
1       2    20056
2       3    17313
3       4    17304
4       5    16380
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Nick!!
1

Let's use groupby with a lambda function that uses size and nunique then rename the series to 'agg_col' and reset_index to get a dataframe.

import pandas as pd
import numpy as np

np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
                   'Col2':np.random.randint(1000,9999,50),
                   'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})

df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()

Output:

  Col1  agg_col
0    A      120
1    B       96
2    C      190

2 Comments

Thnaks both of you. The only reason I am choosing @nick-braunagel answer is because he came first and because he has lower reputation in order to increase it :)
@apostolos. Great! Glad it worked for you. Thanks for the upvote.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.