4

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.

I can get the max across rows or columns by changing the first argument of

df.idxmax()

however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.

For example, I can do this in numpy:

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

But when I try something similar in pandas:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.

E.g. for the above I'd like a function which does:

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

or even just

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

a la numpy.where()

2
  • Should topn(df,3) return (['b','c','a'],['e','f','f']) or (['b','c','b'],['e','f','f']) Commented Nov 12, 2014 at 9:07
  • the latter. df['a']['f'] = 6 while df['b']['f'] = 7 Commented Nov 12, 2014 at 9:35

3 Answers 3

3

I figured out the first part:

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx]) 

Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx]) 

Trying this on the example:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.

Sign up to request clarification or add additional context in comments.

1 Comment

this seems not to work for arbitrary DataFrames, topn(pd.DataFrame(np.random.randn(40,5)),4) gives me an IndexError: index out of bounds error
3

what you want to use is stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64

Comments

1

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics solution takes slightly less

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

but way faster than stack

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop

3 Comments

sorry but this is wrong in several ways and doesn't address my question. i added some explicit examples to make it more clear. also fyi df.describe().loc['max'] is much easier done as df.max()
hey @greole, cool a completely different way! for the communities benefit can you explain a bit about your code? some lines are somewhat cryptic, e.g. *[[col]*len(df) for col in df.columns]) and MultiIndex.from_tuples(zip(cols,val.index)) maybe take it piece by piece or supplement with comments? also if you package this as a function, we can compare runtimes... thanks for the different way!
@watsonic ok I refactored it a bit and added some comments. Avoiding to operate on DataFrame it self gave a huge speed benefit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.