2

The Data

I've got a dataframe that has rank scores for a given ID:

>>> ranks
  ID  rank
0  A     6
1  B     9
2  C     6
3  D     1
4  E     1
5  F     2

I would like to turn this into a square matrix with each ID as both an index and a column, based on several conditions: if the rank of an ID on the index is higher than the rank of the ID in the column, set it to 1, if it is lower, set it to 0, if it is equal, set it to 0.5, and if the index is the same as the column, set it to np.nan. This is better described by looking at my desired matrix:

Desired Result

>>> mtrx
     A    B    C    D    E    F
A  NaN  1.0  0.5  0.0  0.0  0.0
B  0.0  NaN  0.0  0.0  0.0  0.0
C  0.5  1.0  NaN  0.0  0.0  0.0
D  1.0  1.0  1.0  NaN  0.5  1.0
E  1.0  1.0  1.0  0.5  NaN  1.0
F  1.0  1.0  1.0  0.0  0.0  NaN

What I've Done (works, but is slow)

The following loop works, but with larger dataframes, it is slow. If someone can point me in the direction of a nicer more pythonic/pandorable way to achieve this, I'd love some help:

# Make an empty matrix as a dataframe
mtrx = pd.DataFrame(np.zeros((len(IDs), len(IDs))), index=IDs, columns = IDs)

# Populate it via for loop
for i in IDs:
    for j in IDs:
        i_rank = ranks.loc[ranks['ID'] == i].iloc[0]['rank']
        j_rank = ranks.loc[ranks['ID'] == j].iloc[0]['rank']
        if i == j:
            mtrx.loc[i, j] = np.nan
        elif i_rank < j_rank:
            mtrx.loc[i, j] = 1.
        elif i_rank == j_rank:
            mtrx.loc[i, j] = 0.5

Code to reproduce this toy example

import pandas as pd
import numpy as np
np.random.seed(1)
IDs = list('ABCDEF')
ranks = pd.DataFrame({'ID':IDs, 'rank':np.random.randint(1,10,len(IDs))})

1 Answer 1

2

numpy approach

s=ranks['rank'].values
s1=(s>s[:,None]).astype(int).astype(float)
s1[s==s[:,None]]=0.5
s1[[np.arange(len(s))]*2] = np.nan
pd.DataFrame(s1,index=ranks.ID,columns=ranks.ID)


Out[843]: 
ID    A    B    C    D    E    F
ID                              
A   NaN  1.0  0.5  0.0  0.0  0.0
B   0.0  NaN  0.0  0.0  0.0  0.0
C   0.5  1.0  NaN  0.0  0.0  0.0
D   1.0  1.0  1.0  NaN  0.5  1.0
E   1.0  1.0  1.0  0.5  NaN  1.0
F   1.0  1.0  1.0  0.0  0.0  NaN

pandas approach

s=ranks.assign(key=1).merge(ranks.assign(key=1),on='key')
s['New']=(s['rank_x']<s['rank_y']).astype(int)
s.loc[s['rank_x']==s['rank_y'],'New']=0.5
s.loc[s['ID_x']==s['ID_y'],'New']=np.nan

s.set_index(['ID_x','ID_y']).New.unstack()
Out[854]: 
ID_y    A    B    C    D    E    F
ID_x                              
A     NaN  1.0  0.5  0.0  0.0  0.0
B     0.0  NaN  0.0  0.0  0.0  0.0
C     0.5  1.0  NaN  0.0  0.0  0.0
D     1.0  1.0  1.0  NaN  0.5  1.0
E     1.0  1.0  1.0  0.5  NaN  1.0
F     1.0  1.0  1.0  0.0  0.0  NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Great! That's much faster. I'll wait a tiny bit to see whether there are any pandas solutions (just out of interest), but otherwise I'll accept this for sure. Thanks!
@sacul yep , it should be (s['rank_x']<s['rank_y']) sorry for the misleading :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.