Using Python 3.6 and Pandas 0.19.2
I have a DataFrame such as this one:
tid datetime event data
0 0 2017-03-22 10:59:59.864 START NaN
1 0 2017-03-22 10:59:59.931 END NaN
2 0 2017-03-22 10:59:59.935 START NaN
3 1 2017-03-22 10:59:59.939 END NaN
4 0 2017-03-22 10:59:59.940 END NaN
5 1 2017-03-22 10:59:59.941 START NaN
6 1 2017-03-22 10:59:59.945 END NaN
7 0 2017-03-22 10:59:59.947 START NaN
8 1 2017-03-22 10:59:59.955 START NaN
which contains start dates and end dates for transaction occurring inside threads (tid is the thread id). Sadly, the transaction themselves do not have an unique ID. So I need to group those rows by tid, order them by date, then take the lines 2 by 2, in order to have 1 START and 1 END for each transaction.
My current problem is that my initial dataframe may miss the first START event for each thread (in my above example, the line with index 3 is an END event with no previous START). I need to remove those END lines, but I don't know how to do that.
Same problem for the last START lines that do not have a matching END line.
Sample Input
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])
Expected output
Same dataframe but with the line #2 dropped, because it is the first line for Tid 1 and is not a START event:
tid datetime event
0 0 2017-03-22 10:59:59.864 START
1 0 2017-03-22 10:59:59.931 END
3 1 2017-03-22 10:59:59.933 START
4 1 2017-03-22 10:59:59.945 END
5 0 2017-03-22 10:59:59.947 START
6 0 2017-03-22 10:59:59.955 END
BTW, bonus points if you end up with something like:
tid start_datetime stop_datetime
0 0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1 1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2 0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955
What I have tried
df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END' does not contain the initial index from my dataframe, so I cannot use it to drop the lines. (or, if I can, it is not obvious how to do that)