Pandas - Truncate dataframe on a time gap

Question

I want to keep the last few rows, but such that once there is a time gap above 100ms, cut off the rest of the dataframe. For example:

Input:

           Time  X
0   12:30:00.00  A
1  12:30:00.100  B
2  12:30:00.202  C
3  12:30.00.300  D

Output

           Time  X
2  12:30:00.202  C
3  12:30.00.300  D

Explanation: there's more than 100ms between rows B and C, so we throw away everything above row C.

What is your expected behavior when there are multiple 100ms+ gaps in the data? Take the last group past the gaps? — Marc Talbot
– Marc Talbot, Commented May 31, 2016 at 14:01
No, truncate first time there is 100ms gap, and by first time I mean when looking from the end towards the start(top). — Baron Yugovich
– Baron Yugovich, Commented May 31, 2016 at 14:31

jezrael · Accepted Answer · 2016-05-31 17:59:02Z

You can use diff comparing with Timedelta by to_timedelta, then cumsum with comparing with 1. Last use boolean indexing:

df['Time']= pd.to_datetime(df['Time'], format='%H:%M:%S.%f')

print (df)
                     Time  X
0 1900-01-01 12:30:00.000  A
1 1900-01-01 12:30:00.100  B
2 1900-01-01 12:30:00.202  C
3 1900-01-01 12:30:00.300  D

print (df.Time.diff())
0               NaT
1   00:00:00.100000
2   00:00:00.102000
3   00:00:00.098000
Name: Time, dtype: timedelta64[ns]

mask = (((df.Time.diff() > pd.to_timedelta('00:00:00.100000')).cumsum()) >= 1)
print (mask)
0    False
1    False
2     True
3     True
Name: Time, dtype: bool

print (df[mask])
                     Time  X
2 1900-01-01 12:30:00.202  C
3 1900-01-01 12:30:00.300  D

If need column Time not changed ans split on first value higher as 100ms:

df['Time1']= pd.to_datetime(df['Time'], format='%H:%M:%S.%f')
print (df)
           Time  X                   Time1
0   12:30:00.00  A 1900-01-01 12:30:00.000
1  12:30:00.100  B 1900-01-01 12:30:00.100
2  12:30:00.202  C 1900-01-01 12:30:00.202
3  12:30:00.300  D 1900-01-01 12:30:00.300
1  12:30:00.100  E 1900-01-01 12:30:00.100
2  12:30:00.202  F 1900-01-01 12:30:00.202

print (df.Time1.diff())
0                        NaT
1            00:00:00.100000
2            00:00:00.102000
3            00:00:00.098000
1   -1 days +23:59:59.800000
2            00:00:00.102000
Name: Time1, dtype: timedelta64[ns]

mask = (((df.Time1.diff() > pd.to_timedelta('00:00:00.100000')).cumsum()) >= 1)
print (mask)
0    False
1    False
2     True
3     True
1     True
2     True
Name: Time1, dtype: bool

print (df[mask].drop('Time1',axis=1))
           Time  X
2  12:30:00.202  C
3  12:30:00.300  D
1  12:30:00.100  E
2  12:30:00.202  F

If need split by last value:

print (df)
           Time  X
0   12:30:00.00  A
1  12:30:00.100  B
2  12:30:00.202  C
3  12:30:00.300  D
1  12:30:00.100  E
2  12:30:00.202  F

#create helper series
time_ser= pd.to_datetime(df['Time'], format='%H:%M:%S.%f')
#get differences
print (time_ser.diff())
0                        NaT
1            00:00:00.100000
2            00:00:00.102000
3            00:00:00.098000
1   -1 days +23:59:59.800000
2            00:00:00.102000
Name: Time, dtype: timedelta64[ns]

#compare with 100ms timedalta
mask = (((time_ser.diff() > pd.to_timedelta('00:00:00.100000')).cumsum()))
print (mask)
0    0
1    0
2    1
3    1
1    1
2    2
Name: Time, dtype: int32

#get last value of mask
last_val = mask.iat[-1]
print(last_val)
2

#compare mask with last value and use boolean indexing
print (df[mask == last_val])
           Time  X
2  12:30:00.202  F

I edit answer by spliting by last value, please check solution. Thanks.

Collectives™ on Stack Overflow

Pandas - Truncate dataframe on a time gap

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related