Assign value based on output of previous row

Question

I'm analyzing the output log of an app with pandas and want to assign each entry into a session. A session is defined as a 60-minute period from the start.

Here's a small example:

import numpy as np
import pandas as pd
from datetime import timedelta

> df = pd.DataFrame({
    'time': [
        pd.Timestamp(2019, 1, 1, 1, 10),
        pd.Timestamp(2019, 1, 1, 1, 15),
        pd.Timestamp(2019, 1, 1, 1, 20),
        pd.Timestamp(2019, 1, 1, 2, 20),
        pd.Timestamp(2019, 1, 1, 5, 0),
        pd.Timestamp(2019, 1, 1, 5, 15)
    ]
})

> df
                   time
0   2019-01-01 01:10:00
1   2019-01-01 01:15:00
2   2019-01-01 01:20:00
3   2019-01-01 02:20:00
4   2019-01-01 05:00:00
5   2019-01-01 05:15:00

For the first row, the start_time is equal to time. For subsequent rows, if its time is within 1hr of the previous row then it's considered to be in the same session. If not, it will start a new session with start_time = time. I'm using a loop:

df['start_time'] = np.nan

for index in df.index:
    if index == 0:
        start_time = df['time'][index]
    else:
        delta = df['time'][index] - df['time'][index - 1]
        start_time = df['start_time'][index - 1] if delta.total_seconds() <= 3600 else df['time'][index]

    df['start_time'][index] = start_time

Output:

                   time          start_time
0   2019-01-01 01:10:00 2019-01-01 01:10:00
1   2019-01-01 01:15:00 2019-01-01 01:10:00
2   2019-01-01 01:20:00 2019-01-01 01:10:00
3   2019-01-01 02:20:00 2019-01-01 01:10:00
4   2019-01-01 05:00:00 2019-01-01 05:00:00 # new session
5   2019-01-01 05:15:00 2019-01-01 05:00:00

It works but very slowly. Is there a way to vectorize it?

BENY · Accepted Answer · 2019-03-08 00:21:45Z

2

Using diff with cumsum create the group key , then we just using that key get the first value of each group

s=(df.time.diff()/np.timedelta64(1, 's')).gt(3600).cumsum()
df.groupby(s)['time'].transform('first')
Out[833]: 
0   2019-01-01 01:10:00
1   2019-01-01 01:10:00
2   2019-01-01 01:10:00
3   2019-01-01 01:10:00
4   2019-01-01 05:00:00
5   2019-01-01 05:00:00
Name: time, dtype: datetime64[ns]
df['statr_time']=df.groupby(s)['time'].transform('first')

answered Mar 8, 2019 at 0:21

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

rhedak · Accepted Answer · 2019-03-08 00:46:06Z

1

I used np where, shift and cumsum to make a session id. Then I used transform and min to get the start time

df['session_id'] = np.where((df['time'] - df['time'].shift(1)).astype('timedelta64[m]').fillna(0)>60,1,0).cumsum()
df['start_time'] = df.groupby(['session_id'])['time'].transform(min)

display(df)

    time    session_id  start_time
0   2019-01-01 01:10:00 0   2019-01-01 01:10:00
1   2019-01-01 01:15:00 0   2019-01-01 01:10:00
2   2019-01-01 01:20:00 0   2019-01-01 01:10:00
3   2019-01-01 02:20:00 0   2019-01-01 01:10:00
4   2019-01-01 05:00:00 1   2019-01-01 05:00:00
5   2019-01-01 05:15:00 1   2019-01-01 05:00:00

edited Mar 8, 2019 at 0:46

answered Mar 8, 2019 at 0:23

rhedak

4095 silver badges13 bronze badges

2 Comments

BENY Over a year ago

Try look at you output ...I think you are close but still not answer the question

rhedak Over a year ago

I changed my approach slightly... this should work. Sorry about the confusion. I guess now it is similar to your answer...

Collectives™ on Stack Overflow

Assign value based on output of previous row

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related