Pandas dataframe: return row AND column of maximum value(s)

Question

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.

I can get the max across rows or columns by changing the first argument of

df.idxmax()

however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.

For example, I can do this in numpy:

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

But when I try something similar in pandas:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.

E.g. for the above I'd like a function which does:

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

or even just

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

a la numpy.where()

Should topn(df,3) return (['b','c','a'],['e','f','f']) or (['b','c','b'],['e','f','f']) — greole
– greole, Commented Nov 12, 2014 at 9:07

Community · Accepted Answer · 2017-05-23 10:27:55Z

3

I figured out the first part:

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])

Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx])

Trying this on the example:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Nov 12, 2014 at 9:50

watsonic

3,4132 gold badges29 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

greole Over a year ago

this seems not to work for arbitrary DataFrames, topn(pd.DataFrame(np.random.randn(40,5)),4) gives me an IndexError: index out of bounds error

acushner · Accepted Answer · 2014-11-12 14:05:01Z

3

what you want to use is stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64

answered Nov 12, 2014 at 14:05

acushner

9,9461 gold badge38 silver badges37 bronze badges

Comments

greole · Accepted Answer · 2014-11-13 08:25:11Z

1

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics solution takes slightly less

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

but way faster than stack

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop

edited Nov 13, 2014 at 8:25

answered Nov 12, 2014 at 7:17

greole

4,8216 gold badges34 silver badges49 bronze badges

3 Comments

watsonic Over a year ago

sorry but this is wrong in several ways and doesn't address my question. i added some explicit examples to make it more clear. also fyi df.describe().loc['max'] is much easier done as df.max()

watsonic Over a year ago

hey @greole, cool a completely different way! for the communities benefit can you explain a bit about your code? some lines are somewhat cryptic, e.g. *[[col]*len(df) for col in df.columns]) and MultiIndex.from_tuples(zip(cols,val.index)) maybe take it piece by piece or supplement with comments? also if you package this as a function, we can compare runtimes... thanks for the different way!

greole Over a year ago

@watsonic ok I refactored it a bit and added some comments. Avoiding to operate on DataFrame it self gave a huge speed benefit.

Collectives™ on Stack Overflow

Pandas dataframe: return row AND column of maximum value(s)

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related