How can I loop through this pandas dataframe faster?

Question

I have this loop that iterates over a dataframe and creates a cumulative value. I have around 450k rows in my dataframe and it takes in excess of 30 minutes to complete.

Here is the head of my dataframe:

timestamp  open   high  low     close   volume  vol_thrs    flg

1970-01-01 09:30:59 136.01  136.08  135.94  136.030 5379100 0.0 0.0
1970-01-01 09:31:59 136.03  136.16  136.01  136.139 759900  0.0 0.0
1970-01-01 09:32:59 136.15  136.18  136.10  136.180 609000  0.0 0.0
1970-01-01 09:33:59 136.18  136.18  136.07  136.100 510900  0.0 0.0
1970-01-01 09:34:59 136.11  136.15  136.05  136.110 306400  0.0 0.0

The timestamp column is the index.

Any thoughts on how I make this quicker?

for (i, (idx, row)) in enumerate(df.iterrows()):
    if i == 0:
        tmp_cum = df.loc[idx, 'volume']
    else:
        tmp_cum = tmp_cum + df.loc[idx, 'volume']

    if tmp_cum >= df.loc[idx, 'vol_thrs']:
        tmp_cum = 0
        df.loc[idx, 'flg'] = 1

Can you provide a few rows of data so we can understand what this calculation is doing? — jpp
– jpp, Commented May 10, 2018 at 15:11
I can't really without revealing too much about what I'm doing. :-/ — Shyrka
– Shyrka, Commented May 10, 2018 at 15:14
OK. I mean you don't have to share your real data, your data can be [1, 2, 3, 4, 5] for all intents, we just need something we can run. — jpp
– jpp, Commented May 10, 2018 at 15:16
I've edited the post with a sample dataframe. Of course the real one has over 450k rows. — Shyrka
– Shyrka, Commented May 10, 2018 at 15:24
It looks like you only care about the 'volume' column, and the ordering (since you are looping over rows in order of index). Can you instead just do your iteration over a numpy array of volumes? e.g. vol_array = df.volume.values, then do your iteration over vol_array? Should be orders of magnitude faster than repeatedly searching for and grabbing individual rows of the df. — n3utrino
– n3utrino, Commented May 10, 2018 at 19:52

LetEpsilonBeLessThanZero · Accepted Answer · 2018-05-10 17:42:35Z

1

Try using df.at instead of df.loc, as so:

for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
    tmp_cum = df.at[idx, 'volume']
else:
    tmp_cum = tmp_cum + df.at[idx, 'volume']

if tmp_cum >= df.at[idx, 'vol_thrs']:
    tmp_cum = 0
    df.at[idx, 'flg'] = 1

df.at should theoretically perform better. df.at is better if you're accessing a single data value, which is the case in your function. df.loc will let you do slicing, but df.at won't.

answered May 10, 2018 at 17:42

LetEpsilonBeLessThanZero

2,4032 gold badges14 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shyrka Over a year ago

Wow! Down from 37 minutes to 13 seconds :) Thanks

Collectives™ on Stack Overflow

How can I loop through this pandas dataframe faster?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related