I want to calculate the travel time of each passengers in my data frame based on the difference between the moment where they first get in the bus and the moment they leave.
Here is the data frame
my_df = pd.DataFrame({
'id': ['a', 'b', 'b', 'b', 'b', 'b', 'c','d'],
'date': ['2020/02/03', '2020/04/05', '2020/04/05', '2020/04/05','2020/04/06', '2020/04/06', '2020/12/15', '2020/06/23'],
'arriving_time': ['14:36:06', '08:52:02', '08:53:02', '08:55:24', '18:58:03', '19:03:05', '17:04:28', '21:31:23'],
'leaving_time': ['14:40:05', '08:52:41', '08:54:33', '08:57:14', '19:01:07', '19:04:08', '17:09:48', '21:50:12']
})
print(my_df)
output:
id date arriving_time leaving_time
0 a 2020/02/03 14:36:06 14:40:05
1 b 2020/04/05 08:52:02 08:52:41
2 b 2020/04/05 08:53:02 08:54:33
3 b 2020/04/05 08:55:24 08:57:14
4 b 2020/04/06 18:58:03 19:01:07
5 b 2020/04/06 19:03:05 19:04:08
6 c 2020/12/15 17:04:28 17:09:48
7 d 2020/06/23 21:31:23 21:50:12
However there is two problems (that I don't manage to solve myself):
- passengers are detected via their phone signal but the signal is often unstable, this is why for a same person, we can have many rows (like the passenger b in the above data set). "arriving_time" is the time where the signal is detected and "leaving_time" the time where the signal is lost
- To compute the travel time, I need to substract, for each unique ID and for each travel, the least recent arriving_time to the most recent leaving time.
Here is the result I want to obtain
id date arriving_time leaving_time travelTime
0 a 2020/02/03 14:36:06 14:40:05 00:03:59
1 b 2020/04/05 08:52:02 08:52:41 00:05:12
2 b 2020/04/05 08:53:02 08:54:33 00:05:12
3 b 2020/04/05 08:55:24 08:57:14 00:05:12
4 b 2020/04/06 18:58:03 19:01:07 00:06:05
5 b 2020/04/06 19:03:05 19:04:08 00:06:05
6 c 2020/12/15 17:04:28 17:09:48 00:05:20
7 d 2020/06/23 21:31:23 21:50:12 00:18:49
As you can see, passenger b made two different travel on the same day, and I want to know compute how long each one of them last.
I already tried the following code, which seems to work, but it is really slow (which I think is due to the large amount of rows of my_df)
for user_id in set(my_df.id):
for day in set(my_df.loc[my_df.id == user_id, 'date']):
my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'travelTime'] = max(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'leaving_time'].apply(pd.to_datetime)) - min(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'arriving_time'].apply(pd.to_datetime))