7

I'm trying to calculate the percentile of each number within a dataframe and add it to a new column called 'percentile'.

This is my attempt:

import pandas as pd
from scipy import stats

data = {'symbol':'FB','date':['2012-05-18','2012-05-21','2012-05-22','2012-05-23'],'close':[38.23,34.03,31.00,32.00]}

df = pd.DataFrame(data)

close = df['close']

for i in df:
    df['percentile'] = stats.percentileofscore(close,df['close'])

The column is not being filled and results in 'NaN'. This should be fairly easy, but I'm not sure where I'm going wrong.

Thanks in advance for the help.

2
  • no need for looping through for i in df. see this answer stackoverflow.com/a/44607827/1870832 Commented Jun 18, 2017 at 3:06
  • You should know broadcast in Pandas. see this broadcast. Commented Jun 18, 2017 at 3:16

1 Answer 1

9
df.close.apply(lambda x: stats.percentileofscore(df.close.sort_values(),x))

or

df.close.rank(pct=True)

Output:

0    1.00
1    0.75
2    0.25
3    0.50
Name: close, dtype: float64
Sign up to request clarification or add additional context in comments.

3 Comments

very simple answer, thanks @scott-boston
Use .rank -- should be significantly faster
.rank is 100% what you should use. That lambda function while correct will be MUCH slower

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.