I have a pandas dataframe which I'd like to filter based on if certain conditions are met. I ran a loop and a .apply() and used %%timeitto test for speed. The dataset has around 45000 rows. The code snippet for loop is:
%%timeit
qualified_actions = []
for row in all_actions.index:
if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
qualified_actions.append(True)
else:
qualified_actions.append(False)
1.44 s ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And for .apply() is:
%%timeit
qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)
6.71 s ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I thought .apply() is supposed to be much faster than looping through rows in pandas. Can someone explain why it's slower in this case?
applymust construct adictfor every row. Meanwhile, yourformethod efficiently accesses the data usingix, without constructing any new objects. I believe this only happens when you apply a Python function; applying numpy functions you stay in C-land and things go fast..applyis not suppose to be faster when iterating over rows. An.applyis essentially a for-loop underneath the hood, if you go on axis=1. See hereSeriesout of each row, actually, but yeah, same effect.