2

I'm analyzing the output log of an app with pandas and want to assign each entry into a session. A session is defined as a 60-minute period from the start.

Here's a small example:

import numpy as np
import pandas as pd
from datetime import timedelta

> df = pd.DataFrame({
    'time': [
        pd.Timestamp(2019, 1, 1, 1, 10),
        pd.Timestamp(2019, 1, 1, 1, 15),
        pd.Timestamp(2019, 1, 1, 1, 20),
        pd.Timestamp(2019, 1, 1, 2, 20),
        pd.Timestamp(2019, 1, 1, 5, 0),
        pd.Timestamp(2019, 1, 1, 5, 15)
    ]
})

> df
                   time
0   2019-01-01 01:10:00
1   2019-01-01 01:15:00
2   2019-01-01 01:20:00
3   2019-01-01 02:20:00
4   2019-01-01 05:00:00
5   2019-01-01 05:15:00

For the first row, the start_time is equal to time. For subsequent rows, if its time is within 1hr of the previous row then it's considered to be in the same session. If not, it will start a new session with start_time = time. I'm using a loop:

df['start_time'] = np.nan

for index in df.index:
    if index == 0:
        start_time = df['time'][index]
    else:
        delta = df['time'][index] - df['time'][index - 1]
        start_time = df['start_time'][index - 1] if delta.total_seconds() <= 3600 else df['time'][index]

    df['start_time'][index] = start_time

Output:

                   time          start_time
0   2019-01-01 01:10:00 2019-01-01 01:10:00
1   2019-01-01 01:15:00 2019-01-01 01:10:00
2   2019-01-01 01:20:00 2019-01-01 01:10:00
3   2019-01-01 02:20:00 2019-01-01 01:10:00
4   2019-01-01 05:00:00 2019-01-01 05:00:00 # new session
5   2019-01-01 05:15:00 2019-01-01 05:00:00

It works but very slowly. Is there a way to vectorize it?

2 Answers 2

2

Using diff with cumsum create the group key , then we just using that key get the first value of each group

s=(df.time.diff()/np.timedelta64(1, 's')).gt(3600).cumsum()
df.groupby(s)['time'].transform('first')
Out[833]: 
0   2019-01-01 01:10:00
1   2019-01-01 01:10:00
2   2019-01-01 01:10:00
3   2019-01-01 01:10:00
4   2019-01-01 05:00:00
5   2019-01-01 05:00:00
Name: time, dtype: datetime64[ns]
df['statr_time']=df.groupby(s)['time'].transform('first')
Sign up to request clarification or add additional context in comments.

Comments

1

I used np where, shift and cumsum to make a session id. Then I used transform and min to get the start time

df['session_id'] = np.where((df['time'] - df['time'].shift(1)).astype('timedelta64[m]').fillna(0)>60,1,0).cumsum()
df['start_time'] = df.groupby(['session_id'])['time'].transform(min)

display(df)

    time    session_id  start_time
0   2019-01-01 01:10:00 0   2019-01-01 01:10:00
1   2019-01-01 01:15:00 0   2019-01-01 01:10:00
2   2019-01-01 01:20:00 0   2019-01-01 01:10:00
3   2019-01-01 02:20:00 0   2019-01-01 01:10:00
4   2019-01-01 05:00:00 1   2019-01-01 05:00:00
5   2019-01-01 05:15:00 1   2019-01-01 05:00:00

2 Comments

Try look at you output ...I think you are close but still not answer the question
I changed my approach slightly... this should work. Sorry about the confusion. I guess now it is similar to your answer...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.