0

I have a dataframe object like this:

 Date              ID           Delta
2019-10-16 16:43:46 BA9565P     0 days 00:00:00
2019-10-17 05:28:36 BA9565P     0 days 12:44:50
2019-10-16 16:43:13 BA9565X     0 days 00:00:00
2019-10-17 03:26:52 BA9565X     0 days 10:43:39
2019-10-10 19:17:17 BABRGNR     0 days 00:00:00
2019-10-12 19:43:56 BABRGNR     2 days 00:26:39
2019-10-31 00:48:52 BABRGR8     0 days 00:00:00
2019-11-01 14:33:41 BABRGR8     1 days 13:44:49

If the same ID are within 3 days of each other, then I only need the latest result. However if the same ID are more than 3 days apart, then I want to keep both records. So far I have done this.

df2 = df[df.duplicated(['ID'], keep = False)][['Date', 'ID']]
df2["Date"] = pd.to_datetime(df2["Date"])
df2["Delta"] = df2.groupby(['ID']).diff() 
df2["Delta"] = df2["Delta"].fillna(datetime.timedelta(seconds=0))

However I am not sure how should I continue. I have tried:

df2["Delta2"] = (df2["Delta"] < datetime.timedelta(days=3)

The condition would be True for the first element of the group whether they are within 3 days or not.

df2.groupby(['ID']).filter(lambda x: ((x["Delta"]<datetime.timedelta(days=3)) & \
                                             (x["Delta"] != datetime.timedelta(seconds=0))).any())

Again, it has a similar problem due to .diff() always return "NaT" for the first element. Is there a way to access the last element of the group? Or is there a better way than use groupby().diff() ?

1 Answer 1

2

Solution select all rows of group if difference is more like 3 days per group else last rows for all another groups:

print (df)
                 Date       ID            Delta
0 2019-10-16 16:43:46  BA9565P  0 days 00:00:00
1 2019-10-17 05:28:36  BA9565P  0 days 12:44:50
2 2019-10-16 16:43:13  BA9565X  0 days 00:00:00
3 2019-10-20 03:26:52  BA9565X  0 days 10:43:39 <-chnaged data sample to 2019-10-20
4 2019-10-10 19:17:17  BABRGNR  0 days 00:00:00
5 2019-10-12 19:43:56  BABRGNR  2 days 00:26:39
6 2019-10-31 00:48:52  BABRGR8  0 days 00:00:00
7 2019-11-01 14:33:41  BABRGR8  1 days 13:44:49

#if not sorted dates
#df = df.sort_values(['ID','Date'])
df2 = df[df.duplicated(['ID'], keep = False)]
#get differences
df2["Delta"] = df2.groupby(['ID'])['Date'].diff().fillna(pd.Timedelta(0))
#compare by 3 days
mask = df2["Delta"] < pd.Timedelta(days=3)
#test if all Trues per groups
mask1 = mask.groupby(df2['ID']).transform('all')
#get last row per ID
mask2 = ~df2["ID"].duplicated(keep='last')

#filtering
df2 = df2[~mask1 | mask2]
print (df2)
                 Date       ID           Delta
1 2019-10-17 05:28:36  BA9565P 0 days 12:44:50
2 2019-10-16 16:43:13  BA9565X 0 days 00:00:00
3 2019-10-20 03:26:52  BA9565X 3 days 10:43:39
5 2019-10-12 19:43:56  BABRGNR 2 days 00:26:39
7 2019-11-01 14:33:41  BABRGR8 1 days 13:44:49
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.