find index of a value before the maximum for each column in python dataframe

Question

i have a dataframe as below.

test = pd.DataFrame({'col1':[0,0,1,0,0,0,1,2,0], 'col2': [0,0,1,2,3,0,0,0,0]})
   col1  col2
0     0     0
1     0     0
2     1     1
3     0     2
4     0     3
5     0     0
6     1     0
7     2     0
8     0     0

For each column, i want to find the index of value 1 before the maximum of each column. For example, for the first column, the max is 2, the index of value 1 before 2 is 6. for the second column, the max is 3, the index of value 1 before the value 3 is 2.

In summary, I am looking to get [6, 2] as the output for this test DataFrame. Is there a quick way to achieve this?

Assume there is a col3 where 1 is in index 5, everything else is 0. what do you want to return for col3 in this case? Majority of solutions here return either NaN or 0 for this case — Andy L.
– Andy L., Commented Jun 13, 2019 at 18:14

cs95 · Accepted Answer · 2019-06-13 17:02:51Z

5

Use Series.mask to hide elements that aren't 1, then apply Series.last_valid_index to each column.

m = test.eq(test.max()).cumsum().gt(0) | test.ne(1) 
test.mask(m).apply(pd.Series.last_valid_index)

col1    6
col2    2
dtype: int64

Using numpy to vectorize, you can use numpy.cumsum and argmax:

idx = ((test.eq(1) & test.eq(test.max()).cumsum().eq(0))
            .values
            .cumsum(axis=0)
            .argmax(axis=0))
idx
# array([6, 2])

pd.Series(idx, index=[*test])

col1    6
col2    2
dtype: int64

edited Jun 13, 2019 at 17:02

answered Jun 13, 2019 at 16:41

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Quang Hoang Over a year ago

I think this isn't what OP is asking for. he needs to find the 1 before the maximal value.

cs95 Over a year ago

@QuangHoang "the index of value 1 before 2 is 6. for the second column, the max is 3, the index of value 1 before the value 3 is 2." the output here is 6 and 2. What am I missing?

Quang Hoang Over a year ago

This is coincident, because there are no 1's after the value 3 in col2, which is the maximum value of that column. If test.loc[5,'col2']=1, your solution would give col2 5?

Andy L. Over a year ago

As I commented in other solutions. Assume there is a col3 where 1 is in index 5, everything else is 0. Yours returns NaN same as mine for 1st solution and 0 for 2nd solution :)

cs95 Over a year ago

@AndyL. The question is "find index of a value before the maximum for each column in python dataframe", if 1 is the maximum, then the index should be invalid.

|

Scott Boston · Accepted Answer · 2019-06-13 17:06:17Z

4

Using @cs95 idea of last_valid_index:

test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())

Output:

col1    6
col2    2
dtype: int64

Expained:

Using index slicing to cut each column to start to max value, then look for the values that are equal to one and find the index of the last true value.

Or as @QuangHoang suggests:

test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())

edited Jun 13, 2019 at 17:06

answered Jun 13, 2019 at 16:55

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

5 Comments

Quang Hoang Over a year ago

Or similarly: test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())

Quang Hoang Over a year ago

Surprisingly, cumsum is a tiny bit better.

Andy L. Over a year ago

Assume there is a col3 where 1 is in index 5, everything else is 0. This returns NaN for 1st solution and 0 for 2nd solution. My solution is also returns NaN

Scott Boston Over a year ago

Ah.. yes, that is an edge case I need to modify this solution for.

Scott Boston Over a year ago

@AAA Did these solutions help you? Would you consider accepting a solution?

piRSquared · Accepted Answer · 2019-06-14 13:39:33Z

Overkill with Numpy

t = test.to_numpy()
a = t.argmax(0)

i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]

b = np.empty_like(a)
b.fill(-1)

np.maximum.at(b, j, i)

pd.Series(b, test.columns)

col1    6
col2    2
dtype: int64

`apply`

test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))

col1    6
col2    2
dtype: int64

`cummax`

test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()

col1    6
col2    2
dtype: int64

Timing

I just wanted to use a new tool and do some bechmarking see this post

Results

r.to_pandas_dataframe().T

         10        31        100       316       1000      3162      10000
al_0  0.003696  0.003718  0.005512  0.006210  0.010973  0.007764  0.012008
wb_0  0.003348  0.003334  0.003913  0.003935  0.004583  0.004757  0.006096
qh_0  0.002279  0.002265  0.002571  0.002643  0.002927  0.003070  0.003987
sb_0  0.002235  0.002246  0.003072  0.003357  0.004136  0.004083  0.005286
sb_1  0.001771  0.001779  0.002331  0.002353  0.002914  0.002936  0.003619
cs_0  0.005742  0.005751  0.006748  0.006808  0.007845  0.008088  0.009898
cs_1  0.004034  0.004045  0.004871  0.004898  0.005769  0.005997  0.007338
pr_0  0.002484  0.006142  0.027101  0.085944  0.374629  1.292556  6.220875
pr_1  0.003388  0.003414  0.003981  0.004027  0.004658  0.004929  0.006390
pr_2  0.000087  0.000088  0.000089  0.000093  0.000107  0.000145  0.000300

fig = plt.figure(figsize=(10, 10))
ax = plt.subplot()
r.plot(ax=ax)

Setup

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

def al_0(test): return test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
def wb_0(df): return (df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
def qh_0(test): return (test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
def sb_0(test): return test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
def sb_1(test): return test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
def cs_0(test): return (lambda m: test.mask(m).apply(pd.Series.last_valid_index))(test.eq(test.max()).cumsum().gt(0) | test.ne(1))
def cs_1(test): return pd.Series((test.eq(1) & test.eq(test.max()).cumsum().eq(0)).values.cumsum(axis=0).argmax(axis=0), test.columns)
def pr_0(test): return test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
def pr_1(test): return test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
def pr_2(test):
    t = test.to_numpy()
    a = t.argmax(0)

    i, j = np.where(t == 1)
    mask = i <= a[j]
    i = i[mask]
    j = j[mask]

    b = np.empty_like(a)
    b.fill(-1)

    np.maximum.at(b, j, i)

    return pd.Series(b, test.columns)

import math

def gen_test(n):
    a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
    idx = a.argmax(0)
    while (idx == 0).any():
        a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
        idx = a.argmax(0)        

    for j, i in enumerate(idx):
        a[np.random.randint(i), j] = 1

    return pd.DataFrame(a).add_prefix('col')

@b.add_arguments('DataFrame Size')
def argument_provider():
    for exponent in np.linspace(1, 3, 5):
        size = int(10 ** exponent)
        yield size, gen_test(size)

b.add_functions([al_0, wb_0, qh_0, sb_0, sb_1, cs_0, cs_1, pr_0, pr_1, pr_2])

r = b.run()

BENY · Accepted Answer · 2019-06-13 17:38:31Z

3

A little bit logic here

(df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
Out[187]: 
col1    6
col2    2
dtype: int64

answered Jun 13, 2019 at 17:38

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

BENY Over a year ago

@piRSquared ah ,I really need to try hard find something different method . :-)

Scott Boston Over a year ago

I thought about using the reversing element also.

Quang Hoang · Accepted Answer · 2019-06-13 17:06:23Z

2

Here's a mixed numpy and pandas solution:

(test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()

which is a bit faster than the other solutions.

answered Jun 13, 2019 at 17:06

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

1 Comment

Andy L. Over a year ago

As I commented in other solutions. For case col3 having only one row is 1, everything else is 0. Yours returns 0 for col3. We don't know what OP wants in this case.

Andy L. · Accepted Answer · 2019-06-13 18:00:27Z

1

I would use dropna with where to drop duplicated 1 keeping the last 1, and call idxmin on it.

test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())

Out[1433]:
col1    6
col2    2
dtype: int64

edited Jun 13, 2019 at 18:00

answered Jun 13, 2019 at 17:23

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Collectives™ on Stack Overflow

find index of a value before the maximum for each column in python dataframe

6 Answers 6

6 Comments

5 Comments

Overkill with Numpy

`apply`

`cummax`

Timing

Results

Setup

Comments

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

5 Comments

Overkill with Numpy

apply

cummax

Timing

Results

Setup

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`apply`

`cummax`