Drop consecutive duplicate rows based on condition

Question

I currently have this dataframe:

id  date       outcome
3   03/05/2019  no
3   29/05/2019  no
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   03/12/2019  no
5   26/12/2019  no
5   27/01/2020  yes
5   03/06/2020  yes
6   04/05/2019  no
6   27/10/2019  no
6   26/11/2019  yes
6   28/11/2019  yes
6   29/11/2019  yes
6   13/04/2020  yes
6   14/04/2020  yes
6   24/04/2020  no
6   30/04/2020  no
6   05/05/2020  no

It is grouped based on id and is in ascending order for date.

I want to remove a current row if the row after it has the same outcome. HOWEVER, if an outcome from a row is yes, then the next row must be the FIRST no. This is the desired outcome for the above dataframe:

id  date       outcome
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   26/12/2019  no
5   03/06/2020  yes
6   27/10/2019  no
6   14/04/2020  yes
6   24/04/2020  no

At the moment I am doing this:

m1 = (df['outcome'] != df['outcome'].shift()).cumsum()
updated_df = df.groupby([df['id'],m1]).tail(1)

However, this only gives me the last value (yes/no) of a grouped yes/no count. How can I apply a condition in the most pandas way possible?

I don't understand the condition. For id=3 the row after the 'yes' (i.e. the row with date 03/05/2020 ) is not the first row of the group with outcome 'no', but it's still present in the expected output. — Rodalm
– Rodalm, Commented Nov 6, 2021 at 21:29
@HarryPlotter it's a bit tricky, I got it wrong at first. Basically it's dropping the consecutive duplicates, keeping the last, except after a yes, keeping the first. Everything, per group. — mozway
– mozway, Commented Nov 6, 2021 at 21:41

mozway · Accepted Answer · 2021-11-06 21:44:29Z

3

IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)

Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.

# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')

df2 = df[~m1.eq(m2)|m3]

# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]

Output:

    id        date outcome
2    3  04/09/2019      no
3    3  30/10/2019     yes
4    3  03/05/2020      no
6    5  26/12/2019      no
8    5  03/06/2020     yes
10   6  27/10/2019      no
15   6  14/04/2020     yes
16   6  24/04/2020      no

edited Nov 6, 2021 at 21:44

answered Nov 6, 2021 at 21:20

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ze0ruso Over a year ago

Thanks for the answer, this worked perfectly. This isn't essential but just out of curiosity, say you wanted to keep a "no" just before a "yes" (basically there could be more than one "no", for example, 5 "no's" between two "yes's", where you select the first and last "no"). How would this be achieved? Would this require a significant number of masks?

mozway Over a year ago

@TSRAI it depends, if you only want to get those values it's quite easy. The trickier part is to combine everything. If you have too many conditions, it might be worth extracting rows independently on the various conditions and then joining everything back in a single dataframe.

Collectives™ on Stack Overflow

Drop consecutive duplicate rows based on condition

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related