I have this loop that iterates over a dataframe and creates a cumulative value. I have around 450k rows in my dataframe and it takes in excess of 30 minutes to complete.
Here is the head of my dataframe:
timestamp open high low close volume vol_thrs flg
1970-01-01 09:30:59 136.01 136.08 135.94 136.030 5379100 0.0 0.0
1970-01-01 09:31:59 136.03 136.16 136.01 136.139 759900 0.0 0.0
1970-01-01 09:32:59 136.15 136.18 136.10 136.180 609000 0.0 0.0
1970-01-01 09:33:59 136.18 136.18 136.07 136.100 510900 0.0 0.0
1970-01-01 09:34:59 136.11 136.15 136.05 136.110 306400 0.0 0.0
The timestamp column is the index.
Any thoughts on how I make this quicker?
for (i, (idx, row)) in enumerate(df.iterrows()):
if i == 0:
tmp_cum = df.loc[idx, 'volume']
else:
tmp_cum = tmp_cum + df.loc[idx, 'volume']
if tmp_cum >= df.loc[idx, 'vol_thrs']:
tmp_cum = 0
df.loc[idx, 'flg'] = 1
[1, 2, 3, 4, 5]for all intents, we just need something we can run.