efficient way to populate pandas dataframe based on conditions from another dataframe

Question

The Data

I've got a dataframe that has rank scores for a given ID:

>>> ranks
  ID  rank
0  A     6
1  B     9
2  C     6
3  D     1
4  E     1
5  F     2

I would like to turn this into a square matrix with each ID as both an index and a column, based on several conditions: if the rank of an ID on the index is higher than the rank of the ID in the column, set it to 1, if it is lower, set it to 0, if it is equal, set it to 0.5, and if the index is the same as the column, set it to np.nan. This is better described by looking at my desired matrix:

Desired Result

>>> mtrx
     A    B    C    D    E    F
A  NaN  1.0  0.5  0.0  0.0  0.0
B  0.0  NaN  0.0  0.0  0.0  0.0
C  0.5  1.0  NaN  0.0  0.0  0.0
D  1.0  1.0  1.0  NaN  0.5  1.0
E  1.0  1.0  1.0  0.5  NaN  1.0
F  1.0  1.0  1.0  0.0  0.0  NaN

What I've Done (works, but is slow)

The following loop works, but with larger dataframes, it is slow. If someone can point me in the direction of a nicer more pythonic/pandorable way to achieve this, I'd love some help:

# Make an empty matrix as a dataframe
mtrx = pd.DataFrame(np.zeros((len(IDs), len(IDs))), index=IDs, columns = IDs)

# Populate it via for loop
for i in IDs:
    for j in IDs:
        i_rank = ranks.loc[ranks['ID'] == i].iloc[0]['rank']
        j_rank = ranks.loc[ranks['ID'] == j].iloc[0]['rank']
        if i == j:
            mtrx.loc[i, j] = np.nan
        elif i_rank < j_rank:
            mtrx.loc[i, j] = 1.
        elif i_rank == j_rank:
            mtrx.loc[i, j] = 0.5

Code to reproduce this toy example

import pandas as pd
import numpy as np
np.random.seed(1)
IDs = list('ABCDEF')
ranks = pd.DataFrame({'ID':IDs, 'rank':np.random.randint(1,10,len(IDs))})

sacuL · Accepted Answer · 2018-02-05 18:15:01Z

2

numpy approach

s=ranks['rank'].values
s1=(s>s[:,None]).astype(int).astype(float)
s1[s==s[:,None]]=0.5
s1[[np.arange(len(s))]*2] = np.nan
pd.DataFrame(s1,index=ranks.ID,columns=ranks.ID)


Out[843]: 
ID    A    B    C    D    E    F
ID                              
A   NaN  1.0  0.5  0.0  0.0  0.0
B   0.0  NaN  0.0  0.0  0.0  0.0
C   0.5  1.0  NaN  0.0  0.0  0.0
D   1.0  1.0  1.0  NaN  0.5  1.0
E   1.0  1.0  1.0  0.5  NaN  1.0
F   1.0  1.0  1.0  0.0  0.0  NaN

pandas approach

s=ranks.assign(key=1).merge(ranks.assign(key=1),on='key')
s['New']=(s['rank_x']<s['rank_y']).astype(int)
s.loc[s['rank_x']==s['rank_y'],'New']=0.5
s.loc[s['ID_x']==s['ID_y'],'New']=np.nan

s.set_index(['ID_x','ID_y']).New.unstack()
Out[854]: 
ID_y    A    B    C    D    E    F
ID_x                              
A     NaN  1.0  0.5  0.0  0.0  0.0
B     0.0  NaN  0.0  0.0  0.0  0.0
C     0.5  1.0  NaN  0.0  0.0  0.0
D     1.0  1.0  1.0  NaN  0.5  1.0
E     1.0  1.0  1.0  0.5  NaN  1.0
F     1.0  1.0  1.0  0.0  0.0  NaN

edited Feb 5, 2018 at 18:15

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

answered Feb 5, 2018 at 17:11

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sacuL Over a year ago

Great! That's much faster. I'll wait a tiny bit to see whether there are any pandas solutions (just out of interest), but otherwise I'll accept this for sure. Thanks!

BENY Over a year ago

@sacul yep , it should be (s['rank_x']<s['rank_y']) sorry for the misleading :-)

Collectives™ on Stack Overflow

efficient way to populate pandas dataframe based on conditions from another dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related