1

I have the following dataframe and I want to sort by two columns - first by id and those then need to be sorted in date order, both in ascending order, with the earliest date first:

df = pd.DataFrame(
    {
        "id": [9733, 9733, 9733, 9733, 9733, 9733, 9733, 9733, 2949, 2949, 2949, 2949, 2949, 2949, 2949, 9765, 9765, 9765, 9765, 9765, 9765],
        "date": ["01/05/2008", "01/06/1968", "01/11/2010", "01/12/2016", "09/03/2011", "09/05/1975", "28/04/2011", "29/01/2005", "08/07/1974", "09/03/2021", "10/03/2021", "18/09/1986", "20/07/2021", "23/09/2017", "26/06/2020", "01/04/1963", "01/08/2012", "02/08/2012", "25/06/2021", "30/11/2020", "31/03/1986",],
        "status": ["S", "A", "P", "C", "S", "D", "P", "P", "A", "P", "S", "D", "P", "P", "S", "A", "S", "P", "P", "S", "P"],
    }
)

First, I change the date column to datetime and apply the dd/mm/yyyy format:

df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

Then, I tried using the sort_values function:

df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)

I'm getting the id column sorted in ascending order. However, the date column isn't in ascending order and this is where I can't work out where I'm going wrong:

enter image description here

Ultimately, I want the dataframe sorted by the earliest P status only (I want A status and D status separately which I figure I use .filter to get that, each will be output to an Excel file in separate worksheets), but I'm stuck with the ordering of the id and dates.

Thanks for your help.

1
  • you only want to sort dataframe where only P in the column status ? Commented Sep 20, 2022 at 21:50

2 Answers 2

2

If you convert it to a string, the date will be sorted based on alphabetic/lexicographic order. so your data is sorted correctly, given the fact that you specify a string with dd/mm/yyyy format.

If you don't want it sorted in that order, then sort the data first, then convert it to a string:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['id','date'], ascending=[True,True], inplace=True)
df['date'] = df['date'].dt.strftime('%d/%m/%Y')

Alternatively, to numerically sort the string-formatted dates, you can provide a custom key argument to pd.sort_values:

In [8]: df.sort_values(
   ...:     by=['id','date'],
   ...:     ascending=[True,True],
   ...:     inplace=True,
   ...:     key=lambda col: (
   ...:         pd.to_datetime(col, format="%d/%m/%Y")
   ...:         if col.name == "date"
   ...:         else col
   ...:     ),
   ...: )

In [9]: df
Out[9]:
      id        date status
8   2949  08/07/1974      A
11  2949  18/09/1986      D
13  2949  23/09/2017      P
14  2949  26/06/2020      S
9   2949  09/03/2021      P
10  2949  10/03/2021      S
12  2949  20/07/2021      P
1   9733  01/06/1968      A
5   9733  09/05/1975      D
7   9733  29/01/2005      P
0   9733  01/05/2008      S
2   9733  01/11/2010      P
4   9733  09/03/2011      S
6   9733  28/04/2011      P
3   9733  01/12/2016      C
15  9765  01/04/1963      A
20  9765  31/03/1986      P
16  9765  01/08/2012      S
17  9765  02/08/2012      P
19  9765  30/11/2020      S
18  9765  25/06/2021      P
Sign up to request clarification or add additional context in comments.

2 Comments

Brilliant, thank you Michael! Never thought about the key parameter. Learned a lot from your answer and I appreciate it very much. I have learned about Pandas on some courses (mostly Udemy) but I always seem to struggle in 'real world' issues. Everything I searched on Google about sort_values mentioned the ascending part, but none mentioned the key parameter! Is there some course you can recommend so that I can become as awesome as someone like you? Seems like there are some great developers on Stackoverflow and it'd be good to know where to tap into to get to your level of Pandas knowledge :)
Wow thanks! Glad it helped. Hanging out here is a good strategy, as is always assuming there is a right way to do things and that if you have an issue, someone else has probably already solved it and you just need to find it. That holds true 99.9% of the time for me, and leads to some pretty great finds in the docs and on this site. You’ll get there!
0

if I understand well you question, you need to sort only when the column status has value P:

#1 convert the column date to datetime
df['date'] = pd.to_datetime(df['date'])

#2 split the dataframe to 2 dataframes
df1 = df.loc[df['status'].eq('P')].sort_values(by=['id','date'], ascending=[True,True])
df2 = df.loc[df['status'].ne('P')]

df3 = pd.concat([df1, df2], axis=0, ignore_index=True) #this is your new df 

1 Comment

based on "Ultimately ..., but I'm stuck with the ordering of the id and dates", to me it seems the question focuses on sorting dates, which in the picture are numerically (but not alphabetically) out of order. Filtering based on P is a next step which the OP knows how to deal with.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.