0

I have this loop that iterates over a dataframe and creates a cumulative value. I have around 450k rows in my dataframe and it takes in excess of 30 minutes to complete.

Here is the head of my dataframe:

timestamp  open   high  low     close   volume  vol_thrs    flg

1970-01-01 09:30:59 136.01  136.08  135.94  136.030 5379100 0.0 0.0
1970-01-01 09:31:59 136.03  136.16  136.01  136.139 759900  0.0 0.0
1970-01-01 09:32:59 136.15  136.18  136.10  136.180 609000  0.0 0.0
1970-01-01 09:33:59 136.18  136.18  136.07  136.100 510900  0.0 0.0
1970-01-01 09:34:59 136.11  136.15  136.05  136.110 306400  0.0 0.0

The timestamp column is the index.

Any thoughts on how I make this quicker?

for (i, (idx, row)) in enumerate(df.iterrows()):
    if i == 0:
        tmp_cum = df.loc[idx, 'volume']
    else:
        tmp_cum = tmp_cum + df.loc[idx, 'volume']

    if tmp_cum >= df.loc[idx, 'vol_thrs']:
        tmp_cum = 0
        df.loc[idx, 'flg'] = 1
5
  • 2
    Can you provide a few rows of data so we can understand what this calculation is doing? Commented May 10, 2018 at 15:11
  • I can't really without revealing too much about what I'm doing. :-/ Commented May 10, 2018 at 15:14
  • 1
    OK. I mean you don't have to share your real data, your data can be [1, 2, 3, 4, 5] for all intents, we just need something we can run. Commented May 10, 2018 at 15:16
  • I've edited the post with a sample dataframe. Of course the real one has over 450k rows. Commented May 10, 2018 at 15:24
  • It looks like you only care about the 'volume' column, and the ordering (since you are looping over rows in order of index). Can you instead just do your iteration over a numpy array of volumes? e.g. vol_array = df.volume.values, then do your iteration over vol_array? Should be orders of magnitude faster than repeatedly searching for and grabbing individual rows of the df. Commented May 10, 2018 at 19:52

1 Answer 1

1

Try using df.at instead of df.loc, as so:

for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
    tmp_cum = df.at[idx, 'volume']
else:
    tmp_cum = tmp_cum + df.at[idx, 'volume']

if tmp_cum >= df.at[idx, 'vol_thrs']:
    tmp_cum = 0
    df.at[idx, 'flg'] = 1

df.at should theoretically perform better. df.at is better if you're accessing a single data value, which is the case in your function. df.loc will let you do slicing, but df.at won't.

Sign up to request clarification or add additional context in comments.

1 Comment

Wow! Down from 37 minutes to 13 seconds :) Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.