2

I have a pandas dataframe which I'd like to filter based on if certain conditions are met. I ran a loop and a .apply() and used %%timeitto test for speed. The dataset has around 45000 rows. The code snippet for loop is:

%%timeit
qualified_actions = []
for row in all_actions.index:
    if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
        qualified_actions.append(True)
    else:
        qualified_actions.append(False)

1.44 s ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And for .apply() is:

%%timeit
qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)

6.71 s ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I thought .apply() is supposed to be much faster than looping through rows in pandas. Can someone explain why it's slower in this case?

3
  • 1
    I don't know for sure, but I guess it's due to the fact that apply must construct a dict for every row. Meanwhile, your for method efficiently accesses the data using ix, without constructing any new objects. I believe this only happens when you apply a Python function; applying numpy functions you stay in C-land and things go fast. Commented Dec 11, 2017 at 8:43
  • .apply is not suppose to be faster when iterating over rows. An .apply is essentially a for-loop underneath the hood, if you go on axis=1. See here Commented Dec 11, 2017 at 8:55
  • @Amadan it constructs a Series out of each row, actually, but yeah, same effect. Commented Dec 11, 2017 at 9:01

1 Answer 1

4

apply uses loops under the hood, so if you need better performance the best and the fastest methods are vecorized alternatives.

No loops, only chain 2 conditions vectorized solution:

m1 = all_actions['Lower'] <= all_actions['Mid']
m2 = all_actions['Mid'] <= all_actions['Upper']
qualified_actions = m1 & m2

Thanks Jon Clements for another solution:

all_actions.Mid.between(all_actions.Lower, all_actions.Upper)

Timings:

np.random.seed(2017)
N = 45000
all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper'])

#print (all_actions)

In [85]: %%timeit
    ...: qualified_actions = []
    ...: for row in all_actions.index:
    ...:     if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
    ...:         qualified_actions.append(True)
    ...:     else:
    ...:         qualified_actions.append(False)
    ...: 
    ...: 
__main__:259: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
1 loop, best of 3: 579 ms per loop

In [86]: %%timeit
    ...: (all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1))
    ...: 
1 loop, best of 3: 1.17 s per loop

In [87]: %%timeit
    ...: ((all_actions['Lower'] <= all_actions['Mid']) & (all_actions['Mid'] <= all_actions['Upper']))
    ...: 
1000 loops, best of 3: 509 µs per loop


In [90]: %%timeit
    ...: (all_actions.Mid.between(all_actions.Lower, all_actions.Upper))
    ...: 
1000 loops, best of 3: 520 µs per loop
Sign up to request clarification or add additional context in comments.

2 Comments

Or just all_actions.Mid.between(all_actions.Lower, all_actions.Upper)
@JonClements - Thank you, I add it to answer. In sample data it is a bit slowier, but difference is only 11us, so it is approx same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.