I have a DataFrame of some transactions. I want to group these transactions with respect to their item and time column values: the goal is to group items that are within 1 hour of each other. So we start a new group at the time of the next observation that wasn't within an hour of the observation prior (See column start time in DataFrame B).
Here is the data: I want to convert A to B.
A=
item time result
A 2016-04-18 13:08:25 Y
A 2016-04-18 13:57:05 N
A 2016-04-18 14:00:12 N
A 2016-04-18 23:45:50 Y
A 2016-04-20 16:53:48 Y
A 2016-04-20 17:11:47 N
B 2016-04-18 15:24:48 N
C 2016-04-23 13:20:44 N
C 2016-04-23 14:02:23 Y
B=
item start time end time Ys Ns total count
A 2016-04-18 13:08:25 2016-04-18 14:08:25 1 2 3
A 2016-04-18 23:45:50 2016-04-18 00:45:50 1 0 1
A 2016-04-20 16:53:48 2016-04-20 17:53:48 1 1 2
B 2016-04-18 15:24:48 2016-04-18 16:24:48 0 1 1
C 2016-04-23 13:20:44 2016-04-23 14:20:44 1 1 2
Here is what I did:
grouped = A.groupby('item')
A['end'] = (grouped['time'].transform(lambda grp: grp.min()+pd.Timedelta(hours=1)))
A2 = A.loc[(A['time'] <= A['end'])]
This gives me one group per day: the transaction within 1 hour of the first transaction. So, I'm missing other transactions in the same day but more than 1 hour apart from the first. My struggle is how to get those groups. I can then use pd.crosstab to get the details I want from the result column.
Another idea I have is to sort A by item and time, and then go row by row. If the time is within 1 hour of the previous row, it adds to that group, otherwise, it creates a new group.
groupedin your code? How did you get it?