1
\$\begingroup\$

I am trying to apply a function to each group in a pandas dataframe where the function requires access to the entire group (as opposed to just one row). For this I am iterating over each group in the groupby object. Is this the best way to achieve this?

import pandas as pd
df = pd.DataFrame({'id': [1,1,1,1,2,2,2], 
                   'value': [70,10,20,100,50,5,33], 
                   'other_value': [2.3, 3.3, 7.4, 1.1, 5, 10.3, 12]})
def clean_df(df, v_col, other_col):
    '''This function is just a made up example and might 
       get more complex in real life. ;)
    '''
    prev_points = df[v_col].shift(1)
    next_points = df[v_col].shift(-1)
    return df[(prev_points > 50) | (next_points < 20)]  
grouped = df.groupby('id')
pd.concat([clean_df(group, 'value', 'other_value') for _, group in grouped])

The original dataframe is

    id  other_value value
0   1   2.3         70
1   1   3.3         10
2   1   7.4         20
3   1   1.1         100
4   2   5.0         50
5   2   10.3        5
6   2   12.0        33

The code will reduce it to

    id  other_value value
0   1   2.3         70
1   1   3.3         10
4   2   5.0         50
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

You can directly use apply on the grouped dataframe and it will be passed the whole group:

def clean_df(df, v_col='value', other_col='other_value'):
    '''This function is just a made up example and might 
       get more complex in real life. ;)
    '''
    prev_points = df[v_col].shift(1)
    next_points = df[v_col].shift(-1)
    return df[(prev_points > 50) | (next_points < 20)]  

df.groupby('id').apply(clean_df).reset_index(level=0, drop=True)
#    id  other_value  value
# 0   1          2.3     70
# 1   1          3.3     10
# 4   2          5.0     50

Note that I had to give the other arguments default values, since the function that is applied needs to have only one argument. Another way around this is to make a function that returns the function:

def clean_df(v_col, other_col):
    '''This function is just a made up example and might 
       get more complex in real life. ;)
    '''
    def wrapper(df):
        prev_points = df[v_col].shift(1)
        next_points = df[v_col].shift(-1)
        return df[(prev_points > 50) | (next_points < 20)]  
    return wrapper

Which you can use like this:

df.groupby('id').apply(clean_df('value', 'other_value')).reset_index(level=0, drop=True)

Or you can use functools.partial with your clean_df:

from functools import partial

df.groupby('id') \
  .apply(partial(clean_df, v_col='value', other_col='other_value')) \
  .reset_index(level=0, drop=True)
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.