I'm analyzing the output log of an app with pandas and want to assign each entry into a session. A session is defined as a 60-minute period from the start.
Here's a small example:
import numpy as np
import pandas as pd
from datetime import timedelta
> df = pd.DataFrame({
'time': [
pd.Timestamp(2019, 1, 1, 1, 10),
pd.Timestamp(2019, 1, 1, 1, 15),
pd.Timestamp(2019, 1, 1, 1, 20),
pd.Timestamp(2019, 1, 1, 2, 20),
pd.Timestamp(2019, 1, 1, 5, 0),
pd.Timestamp(2019, 1, 1, 5, 15)
]
})
> df
time
0 2019-01-01 01:10:00
1 2019-01-01 01:15:00
2 2019-01-01 01:20:00
3 2019-01-01 02:20:00
4 2019-01-01 05:00:00
5 2019-01-01 05:15:00
For the first row, the start_time is equal to time. For subsequent rows, if its time is within 1hr of the previous row then it's considered to be in the same session. If not, it will start a new session with start_time = time. I'm using a loop:
df['start_time'] = np.nan
for index in df.index:
if index == 0:
start_time = df['time'][index]
else:
delta = df['time'][index] - df['time'][index - 1]
start_time = df['start_time'][index - 1] if delta.total_seconds() <= 3600 else df['time'][index]
df['start_time'][index] = start_time
Output:
time start_time
0 2019-01-01 01:10:00 2019-01-01 01:10:00
1 2019-01-01 01:15:00 2019-01-01 01:10:00
2 2019-01-01 01:20:00 2019-01-01 01:10:00
3 2019-01-01 02:20:00 2019-01-01 01:10:00
4 2019-01-01 05:00:00 2019-01-01 05:00:00 # new session
5 2019-01-01 05:15:00 2019-01-01 05:00:00
It works but very slowly. Is there a way to vectorize it?