Groupby and apply function to sub-dataframes in Python

Question

How to group by item and date, then for each sub-dataframe, get the actual row where 'data' value is the actual middle value in the sub-dataframe?

Sometimes there are multiple rows where data equals the middle value, in this case we only keep the first row.

df:

    item   date        data
0   22     2012-03-10  10
1   22     2012-03-10  20
2   22     2012-03-10  40
3   24     2012-03-11  40
4   24     2012-03-11  50
5   24     2012-03-11  50

expected output:

1   22     2012-03-10  20
4   24     2012-03-11  50

Quang Hoang · Accepted Answer · 2020-07-08 04:35:37Z

2

You can use groupby().transform() and then boolean indexing:

medians = df.groupby(['item','date'])['data'].transform('median')

# drop duplicates in the case 
# there are multiple rows equal to median
df[df['data']==medians].drop_duplicates(['item','date','data'])

Output:

   item        date  data
1    22  2012-03-10    20
4    24  2012-03-11    50

edited Jul 8, 2020 at 4:35

answered Jul 8, 2020 at 4:28

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nilsinelabore Over a year ago

Thanks for your answer. Using the code it returned AttributeError: 'DataFrame' object has no attribute 'name', do you know what might go wrong?

Quang Hoang Over a year ago

Neither your sample data nor my code contains any word name. You should look at your actual code for df.name and see what you were trying to do there.

wwnde · Accepted Answer · 2020-07-08 04:30:46Z

1

Please .groupby(), .agg(median)

 df[['item', 'date', 'data']].groupby(['date', 'item',]).agg('median').reset_index()

        date  item  data
0  2012-03-10    22    20
1  2012-03-11    24    50

answered Jul 8, 2020 at 4:30

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

2 Comments

nilsinelabore Over a year ago

Thank you for the solution. The output returned however includes 2 identical columns of data. Besides, if I have other additional columns in the dataframe, such as the timestamp, is it possible to keep the original values in these columns?

wwnde Over a year ago

If identical, it results from your dataframe and maybe operations. Looking at the datasample you provided and my output I am unable to see duplicated columns.

Sonu · Accepted Answer · 2020-07-08 04:52:46Z

1

You can use below as sample using pandas df['date'] = pd.to_datetime(df['date']).dt.date

df1 = df.groupby(['data','date'])['date','data'].median()

df1

answered Jul 8, 2020 at 4:52

Sonu

3142 silver badges9 bronze badges

Comments

Vinod Karantothu · Accepted Answer · 2020-07-08 05:23:29Z

1

Try this:

df.groupby(['item', 'date'], as_index=False).median()

Output:

   item        date  data
0    22  2012-03-10    20
1    24  2012-03-11    50

answered Jul 8, 2020 at 5:23

Vinod Karantothu

611 silver badge4 bronze badges

6 Comments

nilsinelabore Over a year ago

Thank you for the answer. There are actually other columns in the dataframe, is it possible that I keep all of them?

Vinod Karantothu Over a year ago

df.groupby(['item', 'date'], as_index=False).agg({'data':'median', 'other_col1':'first', 'other_col2':'first'}). You need to specify func how you want to pick the values either you want to 'sum' or 'first' or 'last' value.

nilsinelabore Over a year ago

When specifying a column as "first", does the row have to comply with the other columns first? If not, what is the order?

Vinod Karantothu Over a year ago

It doesn't need to. Since you are grouping the rows only by this columns 'item' and 'date'. the remaining columns get complicated to pick the value that is why they don't show up if you don't specify, that needs to be specified whether you want to sum/median/min/max/first/last value to be picked.

nilsinelabore Over a year ago

So if I had "first" for an additional column say "value", other factors remain constant, is it choosing the first row after group by ["item", "date"] or is it choosing the first row after group by ["item", "date"], AND where the data column is the middle value of the group?

|

Collectives™ on Stack Overflow

Groupby and apply function to sub-dataframes in Python

4 Answers 4

2 Comments

2 Comments

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related