1

Assuming I have the following Pandas DataFrame:

     U     A         B
0  2000    10       20
1  3000    40        0 
2  2100    20       30
3  2500     0       30 
4  2600    30       40

How can I get the index of first row that both A and B have non-zero value and (A+B)/2 is larger than 15 ?

In this example, I would like to get 2 since it is the first row that have non-zero A and B column and avg value of 25 which is more than 15

Note that this DataFrame is huge, I am looking for the fastest way to the index value.

1
  • does this answer your question? Commented Dec 28, 2020 at 22:33

3 Answers 3

5

Lets try:

 df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
Sign up to request clarification or add additional context in comments.

2 Comments

What if there is NaN instead of 0 in one row ? Does it treat NaN similar to 0 ?
Yap, NaN is treated as 0
1

I find more readable explicit variables, like:

AB2 = (df['A']+df['B'])/2 
filter = (df['A'] != 0) & (df['B'] != 0) & (AB2>15)
your_index = df[filter].index[0]

Performance For this use case (ridiculous dataset)

%%timeit
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
**1.21 ms** ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
AB2 = (df['A']+df['B'])/2 
filter = (df['A'].ne(0)) & (df['B'].ne(0)) & (AB2>15)
df[filter].index[0]
**1.08 ms** ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
**2.71 ms** ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comments

0

If the dataframe is large, query might be faster:

df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]

      2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.