1

How to group by item and date, then for each sub-dataframe, get the actual row where 'data' value is the actual middle value in the sub-dataframe?

Sometimes there are multiple rows where data equals the middle value, in this case we only keep the first row.

df:

    item   date        data
0   22     2012-03-10  10
1   22     2012-03-10  20
2   22     2012-03-10  40
3   24     2012-03-11  40
4   24     2012-03-11  50
5   24     2012-03-11  50

expected output:

1   22     2012-03-10  20
4   24     2012-03-11  50

4 Answers 4

2

You can use groupby().transform() and then boolean indexing:

medians = df.groupby(['item','date'])['data'].transform('median')

# drop duplicates in the case 
# there are multiple rows equal to median
df[df['data']==medians].drop_duplicates(['item','date','data'])

Output:

   item        date  data
1    22  2012-03-10    20
4    24  2012-03-11    50
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer. Using the code it returned AttributeError: 'DataFrame' object has no attribute 'name', do you know what might go wrong?
Neither your sample data nor my code contains any word name. You should look at your actual code for df.name and see what you were trying to do there.
1

Please .groupby(), .agg(median)

 df[['item', 'date', 'data']].groupby(['date', 'item',]).agg('median').reset_index()

        date  item  data
0  2012-03-10    22    20
1  2012-03-11    24    50

2 Comments

Thank you for the solution. The output returned however includes 2 identical columns of data. Besides, if I have other additional columns in the dataframe, such as the timestamp, is it possible to keep the original values in these columns?
If identical, it results from your dataframe and maybe operations. Looking at the datasample you provided and my output I am unable to see duplicated columns.
1

You can use below as sample using pandas df['date'] = pd.to_datetime(df['date']).dt.date

df1 = df.groupby(['data','date'])['date','data'].median()

df1

Comments

1

Try this:

df.groupby(['item', 'date'], as_index=False).median()

Output:

   item        date  data
0    22  2012-03-10    20
1    24  2012-03-11    50

6 Comments

Thank you for the answer. There are actually other columns in the dataframe, is it possible that I keep all of them?
df.groupby(['item', 'date'], as_index=False).agg({'data':'median', 'other_col1':'first', 'other_col2':'first'}). You need to specify func how you want to pick the values either you want to 'sum' or 'first' or 'last' value.
When specifying a column as "first", does the row have to comply with the other columns first? If not, what is the order?
It doesn't need to. Since you are grouping the rows only by this columns 'item' and 'date'. the remaining columns get complicated to pick the value that is why they don't show up if you don't specify, that needs to be specified whether you want to sum/median/min/max/first/last value to be picked.
So if I had "first" for an additional column say "value", other factors remain constant, is it choosing the first row after group by ["item", "date"] or is it choosing the first row after group by ["item", "date"], AND where the data column is the middle value of the group?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.