3

I currently have this dataframe:

id  date       outcome
3   03/05/2019  no
3   29/05/2019  no
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   03/12/2019  no
5   26/12/2019  no
5   27/01/2020  yes
5   03/06/2020  yes
6   04/05/2019  no
6   27/10/2019  no
6   26/11/2019  yes
6   28/11/2019  yes
6   29/11/2019  yes
6   13/04/2020  yes
6   14/04/2020  yes
6   24/04/2020  no
6   30/04/2020  no
6   05/05/2020  no

It is grouped based on id and is in ascending order for date.

I want to remove a current row if the row after it has the same outcome. HOWEVER, if an outcome from a row is yes, then the next row must be the FIRST no. This is the desired outcome for the above dataframe:

id  date       outcome
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   26/12/2019  no
5   03/06/2020  yes
6   27/10/2019  no
6   14/04/2020  yes
6   24/04/2020  no

At the moment I am doing this:

m1 = (df['outcome'] != df['outcome'].shift()).cumsum()
updated_df = df.groupby([df['id'],m1]).tail(1)

However, this only gives me the last value (yes/no) of a grouped yes/no count. How can I apply a condition in the most pandas way possible?

2
  • I don't understand the condition. For id=3 the row after the 'yes' (i.e. the row with date 03/05/2020 ) is not the first row of the group with outcome 'no', but it's still present in the expected output. Commented Nov 6, 2021 at 21:29
  • 2
    @HarryPlotter it's a bit tricky, I got it wrong at first. Basically it's dropping the consecutive duplicates, keeping the last, except after a yes, keeping the first. Everything, per group. Commented Nov 6, 2021 at 21:41

1 Answer 1

3

IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)

Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.

# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')

df2 = df[~m1.eq(m2)|m3]

# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]

Output:

    id        date outcome
2    3  04/09/2019      no
3    3  30/10/2019     yes
4    3  03/05/2020      no
6    5  26/12/2019      no
8    5  03/06/2020     yes
10   6  27/10/2019      no
15   6  14/04/2020     yes
16   6  24/04/2020      no
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer, this worked perfectly. This isn't essential but just out of curiosity, say you wanted to keep a "no" just before a "yes" (basically there could be more than one "no", for example, 5 "no's" between two "yes's", where you select the first and last "no"). How would this be achieved? Would this require a significant number of masks?
@TSRAI it depends, if you only want to get those values it's quite easy. The trickier part is to combine everything. If you have too many conditions, it might be worth extracting rows independently on the various conditions and then joining everything back in a single dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.