Dynamically change row index in loc method of pandas

Question

I have a DatetimeIndex with the name idx:

DatetimeIndex(['2020-10-24 21:00:00+03:00', '2020-10-24 23:00:00+03:00',
           '2020-10-25 08:00:00+03:00', '2020-10-26 08:00:00+03:00',
           '2020-10-27 13:00:00+03:00', '2020-10-29 07:00:00+03:00',
           '2020-10-29 22:00:00+03:00', '2020-10-31 01:00:00+03:00',
           '2020-11-01 16:00:00+03:00', '2020-11-03 18:00:00+03:00',
           '2020-11-04 20:00:00+03:00', '2020-11-05 17:00:00+03:00'],
          dtype='datetime64[ns, Europe/Moscow]', freq=None)

I need to iterate through dataframe rows to calculate cumulative max of 'close' column from each idx element to the next, then from the following to the next, and so on. It works well by doing:

for i in np.arange(len(idx)):
    signals.loc[idx[i]:, 'close_max'] = signals.loc[idx[i]:, 'close'].cummax(axis=0)

But iterating a dataframe is not a good thing. Could you help to make this without for loop?

I don't quite understand what you want to do but do you think that it's possible to do it in parallel or you need to produce extra data for every row? — Charalamm
– Charalamm, Commented Nov 7, 2020 at 10:30
@Charalamm I need to find max value of 'close' column for each interval between the time stamps from idx — Ruslan Asadullin
– Ruslan Asadullin, Commented Nov 7, 2020 at 12:50
Still don't fully get it, sorry for that. Did you try .apply()? It applies the function inside the parenthesis in the whole column — Charalamm
– Charalamm, Commented Nov 7, 2020 at 14:11

Pierre D · Accepted Answer · 2020-11-07 16:36:02Z

1

You can find the integer indices where your idx values are within df.index by using np.searchsorted (bonus: it works even if values of idx are not found in df.index).

Once you have these integer indices, build a grp value suitable for grouping your df. Then groupby and apply cummax.

Putting it all together:

ix = np.concatenate(([0], np.searchsorted(df.index, idx), [df.shape[0]]))
grp = np.repeat(ix[:-1], np.diff(ix))
df['close_max'] = df['close'].groupby(grp).cummax()

Validation:

First, let's build some data similar to yours, for testing:

n = 1000
df = pd.DataFrame(
    420 + np.round(np.cumsum(np.random.normal(size=n)), 2),
    columns=['close'],
    index=pd.date_range('2020-10-24', periods=n, freq='h'))

idx = [
    pd.Timestamp('2020-10-24') + k * pd.Timedelta('1 hour')
    for k in np.cumsum(np.random.randint(1, 48, size=n))
]
idx =[t for t in idx if df.first_valid_index() <= t <= df.last_valid_index()]
idx = pd.DatetimeIndex(idx)

Then, your "signals" calculation, slightly modified so as to not have NaNs:

signals = df[['close']].copy()
signals['close_max'] = signals['close'].cummax()
for t in idx:
    signals.loc[t:, 'close_max'] = signals.loc[t:, 'close'].cummax()

# apply the three lines in the solution above to add 'close_max' to df
# and finally:

signals.equals(df)
# True

answered Nov 7, 2020 at 16:36

Pierre D

26.6k8 gold badges71 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ruslan Asadullin Over a year ago

Thank you! It works just fine and four times faster than in a for-loop.

Collectives™ on Stack Overflow

Dynamically change row index in loc method of pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related