0

Let's say I have:

df = pd.DataFrame({'a' : [1, 2, 3, 4, 5] , 'b' : ['cat_1', 'cat_1', 'cat_2', 'cat_2', 'cat_2']})

I perform a groupby:

df.groupby(['b']).agg(['count', 'median'])

I would like to iterate through the rows that this call returns, for example:

for row in ?:
    print(row)

should print something like:

('cat_1', 2, 1.5)
('cat_2', 3, 4)

2 Answers 2

7

You've misunderstood: df.groupby(['b']).agg(['count', 'median']) returns an in-memory dataframe, not an iterator of groupwise results.

Your result is often expressed in this way:

res = df.groupby('b')['a'].agg(['count', 'median'])

print(res)

#        count  median
# b                   
# cat_1      2     1.5
# cat_2      3     4.0

Iterating a dataframe is possible via iterrows or, more efficiently, itertuples:

for row in df.groupby('b')['a'].agg(['count', 'median']).itertuples():
    print((row.Index, row.count, row.median))

print(res)

# ('cat_1', 2, 1.5)
# ('cat_2', 3, 4.0)

If you are looking to calculate lazily, iterate a groupby object and perform your calculations on each group independently. For data that fits comfortably in memory, you should expect this to be slower than iterating a dataframe of results.

for key, group in df.groupby('b'):
    print((key, group['a'].count(), group['a'].median()))

# ('cat_1', 2, 1.5)
# ('cat_2', 3, 4.0)

If you do face memory issues, consider dask.dataframe for such tasks.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help!
0

This will do the trick:

for item in df.groupby(['b']).agg(['count', 'median']).reset_index().values:
     # Perform operation on 'item' ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.