How can I delete a sequence of rows based on a condition?

Question

I have the following dataframe:

    id outcome
0    3      no
1    3      no
2    3      no
3    3     yes
4    3      no
5    5      no
6    5      no
7    5     yes
8    5      no
9    5     yes
10   6      no
11   6      no
12   6     yes
13   6      no
14   6      no

I want to remove the no outcomes at the start of a sequence before a yes, and keep all other no outcomes, so the output dataframe looks like this:

    id outcome
3    3     yes
4    3      no
7    5     yes
8    5      no
9    5     yes
12   6     yes
13   6      no
14   6      no

At the moment I have tried this:

df = pd.DataFrame(data={
       'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6], 
       'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
     })


df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]

However, this simply removes all no outcomes.

I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?

all the 'no' outcomes before a 'yes', at the start of the sequence per' id'. I also want to know how many of these 'no' before 'yes' at the start of a sequence exists, hence the counting task. Any 'no' after a 'yes' must stay. — Ze0ruso
– Ze0ruso, Commented Nov 17, 2021 at 1:13
@TSRAI: In your wanted outcome, there is no indication of the number of 'no' that was removed. — Shaido
– Shaido, Commented Nov 17, 2021 at 1:20
I found a working answer to your question about keeping the no's at the end of each group. For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently. — user17242583
– user17242583, Commented Nov 17, 2021 at 1:53

Shaido · Accepted Answer · 2021-11-17 02:07:59Z

2

Use groupby with cumsum to mark all 'no' at the start with a 0:

df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())

Now, the number of 'no's to remove is:

num_no_to_remove = (df['no_group'] == 0).sum()

And the wanted dataframe can be obtained by filtering:

df.loc[df['no_group'] > 0].drop(columns=['no_group'])

Result:

    id  outcome
3    3      yes
4    3       no
7    5      yes
8    5       no
9    5      yes
12   6      yes
13   6       no
14   6       no

answered Nov 17, 2021 at 2:07

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ze0ruso Over a year ago

Thanks for attempting it too @user17242583

score 1 · Accepted Answer · 2021-11-17 01:59:07Z

1

For keeping only last no values of each group and all the yes values, this code will do the trick:

df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]

Output:

>>> df
    id outcome
3    3     yes
4    3      no
5    3      no
8    5     yes
9    5     yes
12   6     yes

(In the original df, I added a second no the end of group 3 to make sure it works for multiple no's at the end).

Essentially what the code does is it

Replaces yes values with an arbitrary value (1) in this case
Replaces no values with NaN (which is important!)
Groups the rows by their ID
For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the yes's are 1 and the no's are NaN, this will cause everything except the last no's of the group to be replaced with the arbitrary number (1)
Creates a mask which selects all those last no values of each group
Creates a second mask which selects all yes values
Uses those two masks combined to return , and all yes values, and all no values that are at the end of a group.

For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.

edited Nov 17, 2021 at 1:59

answered Nov 17, 2021 at 1:52

user17242583

3 Comments

Ze0ruso Over a year ago

Hi, thanks for the answer. Apologies, I should have made this clearer. I have updated the question to reflect this. I want to also keep any 'no' outcomes inbetween 'yes'. Basically only removing the 'no' values at the start of a sequence (per id).

user17242583 Over a year ago

Okay, that shouldn't be too hard. Let me see...

user17242583 Over a year ago

So you want to remove no's at the beginning of each group? But your sample data only has no's at the beginning or end or both of each group, but not in between any yes's in each group.

Collectives™ on Stack Overflow

How can I delete a sequence of rows based on a condition?

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related