2

I often need to work with dataframes where the rownames (index in pandas) are like of the type [a-z][0-9]+ (e.g. 'a20','a21',... or 'b1','b11','b2',...)

I would like to be able to sort the rows based on the number alone. From what I can gather of the API, df.sort_index() does not take a function argument where I can define the sorting function.

Is there a neat way of achieving this without resorting to removing the initial characters when creating the dataframe?

4
  • Perhaps you can use reindex. Commented Feb 5, 2018 at 16:18
  • 1
    i think the current best approach is to construct a column from the index with the letter prefixes removed, and sort by that. Commented Feb 5, 2018 at 16:21
  • I feel like the best theoretical way to do this is to have a multi-index of one string column and one integer column, then you can use standard pandas sorting and features. However, one of the current answers might very well be a better practical answer since multi-indexes introduce more complexity. Commented Feb 5, 2018 at 16:48
  • I agree with @HaleemurAli, but only if the numeric part of the index is a unique identifier. I.e. how do you want to deal with A25 & B25 (if that is possible). Commented Feb 5, 2018 at 16:56

3 Answers 3

4

You can use extract, convert to int and get position of sorted values by argsort, last use iloc:

(borrowing data from @Dark):

df = df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()]
print (df)
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4

Explanation:

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])
print (df)
     0
a12  1
a10  2
a11  3
a15  4
a5   5
a6   6
b11  7
b12  8

First extract all integers:

print (df.index.str.extract('(\d+)', expand=False))
Index(['12', '10', '11', '15', '5', '6', '11', '12'], dtype='object')

Cast to integers:

print (df.index.str.extract('(\d+)', expand=False).astype(int))
Int64Index([12, 10, 11, 15, 5, 6, 11, 12], dtype='int64')

Get positions of sorted values by argsort:

print (df.index.str.extract('(\d+)', expand=False).astype(int).argsort())
[4 5 1 2 6 0 7 3]

Last select values by positions by iloc:

print (df.iloc[df.index.str.extract('(\d+)', expand=False).astype(int).argsort()])
     0
a5   5
a6   6
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4
Sign up to request clarification or add additional context in comments.

1 Comment

Could you elaborate a bit on what this does? It's cool that it seems to work but I'd like to be able to understand what it does :)
2

By using natsort sort by index list then reindex

df=pd.DataFrame({'v':[1,2,3]},index=['b1','b11','b2'])

from natsort import natsorted, ns


df.reindex(natsorted(df.index.tolist(), alg=ns.IGNORECASE))

Out[795]: 
     v
b1   1
b2   3
b11  2

2 Comments

You were a bit quicker
@Dark yep , just a little bit :-)
2

You can use sorted and reindex i.e

df = pd.DataFrame([1,2,3,4,5,6,7,8],index=['a12','a10','a11','a15','a5','a6','b11','b12'])

import re
df.reindex(sorted(df.index,key=lambda x: re.sub('[A-z]','',x)))


    0
a10  2
a11  3
b11  7
a12  1
b12  8
a15  4
a5   5
a6   6

1 Comment

I think need convert to int :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.