1

I have a DatetimeIndex with the name idx:

DatetimeIndex(['2020-10-24 21:00:00+03:00', '2020-10-24 23:00:00+03:00',
           '2020-10-25 08:00:00+03:00', '2020-10-26 08:00:00+03:00',
           '2020-10-27 13:00:00+03:00', '2020-10-29 07:00:00+03:00',
           '2020-10-29 22:00:00+03:00', '2020-10-31 01:00:00+03:00',
           '2020-11-01 16:00:00+03:00', '2020-11-03 18:00:00+03:00',
           '2020-11-04 20:00:00+03:00', '2020-11-05 17:00:00+03:00'],
          dtype='datetime64[ns, Europe/Moscow]', freq=None)

I need to iterate through dataframe rows to calculate cumulative max of 'close' column from each idx element to the next, then from the following to the next, and so on. It works well by doing:

for i in np.arange(len(idx)):
    signals.loc[idx[i]:, 'close_max'] = signals.loc[idx[i]:, 'close'].cummax(axis=0)

But iterating a dataframe is not a good thing. Could you help to make this without for loop?

enter image description here

3
  • I don't quite understand what you want to do but do you think that it's possible to do it in parallel or you need to produce extra data for every row? Commented Nov 7, 2020 at 10:30
  • @Charalamm I need to find max value of 'close' column for each interval between the time stamps from idx Commented Nov 7, 2020 at 12:50
  • Still don't fully get it, sorry for that. Did you try .apply()? It applies the function inside the parenthesis in the whole column Commented Nov 7, 2020 at 14:11

1 Answer 1

1

You can find the integer indices where your idx values are within df.index by using np.searchsorted (bonus: it works even if values of idx are not found in df.index).

Once you have these integer indices, build a grp value suitable for grouping your df. Then groupby and apply cummax.

Putting it all together:

ix = np.concatenate(([0], np.searchsorted(df.index, idx), [df.shape[0]]))
grp = np.repeat(ix[:-1], np.diff(ix))
df['close_max'] = df['close'].groupby(grp).cummax()

Validation:

First, let's build some data similar to yours, for testing:

n = 1000
df = pd.DataFrame(
    420 + np.round(np.cumsum(np.random.normal(size=n)), 2),
    columns=['close'],
    index=pd.date_range('2020-10-24', periods=n, freq='h'))

idx = [
    pd.Timestamp('2020-10-24') + k * pd.Timedelta('1 hour')
    for k in np.cumsum(np.random.randint(1, 48, size=n))
]
idx =[t for t in idx if df.first_valid_index() <= t <= df.last_valid_index()]
idx = pd.DatetimeIndex(idx)

Then, your "signals" calculation, slightly modified so as to not have NaNs:

signals = df[['close']].copy()
signals['close_max'] = signals['close'].cummax()
for t in idx:
    signals.loc[t:, 'close_max'] = signals.loc[t:, 'close'].cummax()

# apply the three lines in the solution above to add 'close_max' to df
# and finally:

signals.equals(df)
# True
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! It works just fine and four times faster than in a for-loop.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.