Python pandas: conditionally delete first row of each group

Question

Using Python 3.6 and Pandas 0.19.2

I have a DataFrame such as this one:

   tid                datetime  event  data
0    0 2017-03-22 10:59:59.864  START   NaN
1    0 2017-03-22 10:59:59.931    END   NaN
2    0 2017-03-22 10:59:59.935  START   NaN
3    1 2017-03-22 10:59:59.939    END   NaN
4    0 2017-03-22 10:59:59.940    END   NaN
5    1 2017-03-22 10:59:59.941  START   NaN
6    1 2017-03-22 10:59:59.945    END   NaN
7    0 2017-03-22 10:59:59.947  START   NaN
8    1 2017-03-22 10:59:59.955  START   NaN

which contains start dates and end dates for transaction occurring inside threads (tid is the thread id). Sadly, the transaction themselves do not have an unique ID. So I need to group those rows by tid, order them by date, then take the lines 2 by 2, in order to have 1 START and 1 END for each transaction.

My current problem is that my initial dataframe may miss the first START event for each thread (in my above example, the line with index 3 is an END event with no previous START). I need to remove those END lines, but I don't know how to do that.

Same problem for the last START lines that do not have a matching END line.

Sample Input

import pandas as pd
import io
df = pd.read_csv(io.StringIO('''tid;datetime;event
0;2017-03-22 10:59:59.864;START
0;2017-03-22 10:59:59.931;END
0;2017-03-22 10:59:59.935;START
1;2017-03-22 10:59:59.939;END
0;2017-03-22 10:59:59.940;END
1;2017-03-22 10:59:59.941;START
1;2017-03-22 10:59:59.945;END
0;2017-03-22 10:59:59.947;START
1;2017-03-22 10:59:59.955;START'''), sep=';', parse_dates=['datetime'])

Expected output

Same dataframe but with the line #2 dropped, because it is the first line for Tid 1 and is not a START event:

   tid                datetime  event
0    0 2017-03-22 10:59:59.864  START
1    0 2017-03-22 10:59:59.931    END
3    1 2017-03-22 10:59:59.933  START
4    1 2017-03-22 10:59:59.945    END
5    0 2017-03-22 10:59:59.947  START
6    0 2017-03-22 10:59:59.955    END

BTW, bonus points if you end up with something like:

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

What I have tried

df.sort(['tid', 'datetime']).groupby('tid').first().event == 'END' does not contain the initial index from my dataframe, so I cannot use it to drop the lines. (or, if I can, it is not obvious how to do that)

Scott Boston · Accepted Answer · 2017-04-28 08:11:06Z

1

One approach is (We can tidy up the custom function to handle more diverse inputs, but this works for the sample input.):

df = df.assign(group=(df.tid.diff().fillna(0) != 0).cumsum())

def myTwo(x):
    starttime = x.query('event == "START"')['datetime'].min()
    endtime = x.query('event == "END"')['datetime'].max()
    tid = x.tid.max()
    return pd.Series({'tid':tid,'start_datetime':starttime,'end_datetime':endtime})

print(df.groupby('group').apply(myTwo)[['tid','start_datetime','end_datetime']])

Output:

       tid              start_datetime                end_datetime
group                                                             
0        0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1        1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2        0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

edited Apr 28, 2017 at 8:11

answered Apr 28, 2017 at 8:03

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Guillaume Over a year ago

Thanks! I am still checking the docs to understand your answer ^^

jezrael · Accepted Answer · 2017-04-28 08:17:22Z

1

You can use shift + cumsum for creating unique Series for grouping and then use custom function where select by query and iat, last reorder columns by reindex_axis:

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.query('event == "START"')['datetime'].iat[0]
    end = x.query('event == "END"')['datetime'].iat[-1]
    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))

   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

Another solution with boolean indexing instead query (maybe faster, query is better in larger df):

a = (df.tid != df.tid.shift()).cumsum()

def f(x):
    start = x.loc[df.event == "START", 'datetime'].iat[0]
    end = x.loc[df.event == "END", 'datetime'].iat[-1]

    tid = x.tid.iat[0]
    return pd.Series({'tid':tid,'start_datetime':start,'end_datetime':end})

print(df.groupby(a, as_index=False).apply(f)
        .reindex_axis(['tid','start_datetime','end_datetime'], 1))
   tid              start_datetime                end_datetime
0    0  2017-03-22 10:59:59.864000  2017-03-22 10:59:59.931000
1    1  2017-03-22 10:59:59.933000  2017-03-22 10:59:59.945000
2    0  2017-03-22 10:59:59.947000  2017-03-22 10:59:59.955000

edited Apr 28, 2017 at 8:17

answered Apr 28, 2017 at 8:10

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

8 Comments

Scott Boston Over a year ago

:) I like your answer better. 'shift' cumsum.

jezrael Over a year ago

@ScottBoston - Thank you.

Guillaume Over a year ago

Thanks! I will adapt and test ASAP with my real dataframe which is a bit more complex than the input I posted here. I am not sure what (df.tid != df.tid.shift()).cumsum() is supposed to do ? (looking at the docs right now)

Guillaume Over a year ago

I tested and I'm not sure that works as needed. What happens when my initial dataframe contains 4 or more consecutives lines for the same tid ? (that happens when 2 consecutive events are going to the same thread) If I understand well, only the first and last lines are considered for start_date and end_date, so the lines in the middle will be ignored.

jezrael Over a year ago

@Guillaume - Do you think all values of tid in sample are 0 ? Or something else?

|

Community · Accepted Answer · 2017-05-23 12:09:59Z

1

Here's another approach, with a groupby() strategy based on this answer:

# make boolean mask to check for valid event entries
def valid_event(x):
    if x.name:
        return df.loc[x.name-1,'event']==x.event
    return False
mask = df.apply(check_event, axis='columns')

# subset with mask
df = (df.loc[~mask]
        .groupby(np.arange(len(df2))//2) # groupby every 2 rows
        .agg({'tid':{'tid':'first'},
              'datetime':{'start_datetime':'min',
                          'stop_datetime':'max'}
             })
      )

df.columns = df.columns.droplevel() # drop multi-index cols

print(df)

   tid          start_datetime           stop_datetime
0    0 2017-03-22 10:59:59.864 2017-03-22 10:59:59.931
1    1 2017-03-22 10:59:59.933 2017-03-22 10:59:59.945
2    0 2017-03-22 10:59:59.947 2017-03-22 10:59:59.955

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Apr 28, 2017 at 8:34

andrew_reece

21.4k3 gold badges40 silver badges64 bronze badges

1 Comment

Guillaume Over a year ago

Thank you for this, especially the .groupby(np.arange(len(df2))//2) part. Combined with other responses, I was able to solve my problem :) see my answer

Guillaume · Accepted Answer · 2017-05-02 11:56:32Z

I managed to partially solve my problem this way:

# order events by thread id and datetime
df = df.sort_values(['tid', 'datetime']).reset_index(drop=True)
# then group by tid
for tid, group in df.groupby('tid'):
     # for each group, drop the first line if it is a END event
     head = group.head(1).iloc[0]
     if head.status == 'END':
         df.drop(head.name, inplace=True)
     # and drop the last line if it is a START event
     tail = group.tail(1).iloc[0]
     if tail.status == 'START':
         df.drop(tail.name, inplace=True)

# take lines 2 by 2, that will be a START and an END event, that can be aggregated
df.groupby(np.arange(len(df)) // 2).agg({'Tid': 'first', 'DateTime': {'start': 'min', 'stop': 'max'}})

Collectives™ on Stack Overflow

Python pandas: conditionally delete first row of each group

4 Answers 4

1 Comment

8 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

8 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related