0

Using Python 3.6 and Pandas 0.19.2

I have a DataFrame such as this one:

   tid                datetime  event  data
0    0 2017-03-22 10:59:59.864  START   NaN
1    0 2017-03-22 10:59:59.931    END   NaN
2    0 2017-03-22 10:59:59.935  START   NaN
3    1 2017-03-22 10:59:59.939    END   NaN
4    0 2017-03-22 10:59:59.940    END   NaN
5    1 2017-03-22 10:59:59.941  START   NaN
6    1 2017-03-22 10:59:59.945    END   NaN
7    0 2017-03-22 10:59:59.947  START   NaN
8    1 2017-03-22 10:59:59.955  START   NaN

which contains start dates and end dates for transaction occurring inside threads (tid is the thread id). Sadly, the transaction themselves do not have an unique ID. So I need to group those rows by tid, order them by date, then take the lines 2 by 2, in order to have 1 START and 1 END for each transaction.

My current problem is that my initial dataframe may miss the first START event for each thread (in my above example, the line with index 3 is an END event with no previous START). I need to remove those END lines, but I don't know how to do that.

Same problem for the last START lines that do not have a matching END line.

Sample Input

import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])

Expected output

Same dataframe but with the line #2 dropped, because it is the first line for Tid 1 and is not a START event:

   tid                datetime  event
0    0 2017-03-22 10:59:59.864  START
1    0 2017-03-22 10:59:59.931    END
3    1 2017-03-22 10:59:59.933  START
4    1 2017-03-22 10:59:59.945    END
5    0 2017-03-22 10:59:59.947  START
6    0 2017-03-22 10:59:59.955    END

BTW, bonus points if you end up with something like:

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

What I have tried

df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END' does not contain the initial index from my dataframe, so I cannot use it to drop the lines. (or, if I can, it is not obvious how to do that)

4 Answers 4

1

One approach is (We can tidy up the custom function to handle more diverse inputs, but this works for the sample input.):

df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())

def myTwo(x):
    starttime = x.query('event == "START"')['datetime'].min()
    endtime = x.query('event == "END"')['datetime'].max()
    tid = x.tid.max()
    return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})

print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])

Output:

       tid              start_datetime                end_datetime
group                                                             
0        0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1        1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2        0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I am still checking the docs to understand your answer ^^
1

You can use shift + cumsum for creating unique Series for grouping and then use custom function where select by query and iat, last reorder columns by reindex_axis:

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.query('event == "START"')['datetime'].iat[0]
    end = x.query('event == "END"')['datetime'].iat[-1]
    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))

   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

Another solution with boolean indexing instead query (maybe faster, query is better in larger df):

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.loc[df.event == "START", 'datetime'].iat[0]
    end = x.loc[df.event == "END", 'datetime'].iat[-1]

    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))
   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

8 Comments

:) I like your answer better. 'shift' cumsum.
@ScottBoston - Thank you.
Thanks! I will adapt and test ASAP with my real dataframe which is a bit more complex than the input I posted here. I am not sure what (df.tid != df.tid.shift()).cumsum() is supposed to do ? (looking at the docs right now)
I tested and I'm not sure that works as needed. What happens when my initial dataframe contains 4 or more consecutives lines for the same tid ? (that happens when 2 consecutive events are going to the same thread) If I understand well, only the first and last lines are considered for start_date and end_date, so the lines in the middle will be ignored.
@Guillaume - Do you think all values of tid in sample are 0 ? Or something else?
|
1

Here's another approach, with a groupby() strategy based on this answer:

# make boolean mask to check for valid event entries
def valid_event(x):
    if x.name:
        return df.loc[x.name-1,'event']==x.event
    return False
mask = df.apply(check_event, axis='columns')

# subset with mask
df = (df.loc[~mask]
        .groupby(np.arange(len(df2))//2) # groupby every 2 rows
        .agg({'tid':{'tid':'first'},
              'datetime':{'start_datetime':'min',
                          'stop_datetime':'max'}
             })
      )

df.columns = df.columns.droplevel() # drop multi-index cols

print(df)

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

1 Comment

Thank you for this, especially the .groupby(np.arange(len(df2))//2) part. Combined with other responses, I was able to solve my problem :) see my answer
0

I managed to partially solve my problem this way:

# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
     # for each group, drop the first line if it is a END event
     head = group.head(1).iloc[0]
     if head.status == 'END':
         df.drop(head.name, inplace=True)
     # and drop the last line if it is a START event
     tail = group.tail(1).iloc[0]
     if tail.status == 'START':
         df.drop(tail.name, inplace=True)

# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.