2

I have the following dataframe:

    id outcome
0    3      no
1    3      no
2    3      no
3    3     yes
4    3      no
5    5      no
6    5      no
7    5     yes
8    5      no
9    5     yes
10   6      no
11   6      no
12   6     yes
13   6      no
14   6      no

I want to remove the no outcomes at the start of a sequence before a yes, and keep all other no outcomes, so the output dataframe looks like this:

    id outcome
3    3     yes
4    3      no
7    5     yes
8    5      no
9    5     yes
12   6     yes
13   6      no
14   6      no

At the moment I have tried this:

df = pd.DataFrame(data={
       'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6], 
       'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
     })


df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]

However, this simply removes all no outcomes.

I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?

4
  • all the 'no' outcomes before a 'yes', at the start of the sequence per' id'. I also want to know how many of these 'no' before 'yes' at the start of a sequence exists, hence the counting task. Any 'no' after a 'yes' must stay. Commented Nov 17, 2021 at 1:13
  • @TSRAI: In your wanted outcome, there is no indication of the number of 'no' that was removed. Commented Nov 17, 2021 at 1:20
  • Updated the question :) Commented Nov 17, 2021 at 1:23
  • 1
    I found a working answer to your question about keeping the no's at the end of each group. For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently. Commented Nov 17, 2021 at 1:53

2 Answers 2

2

Use groupby with cumsum to mark all 'no' at the start with a 0:

df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())

Now, the number of 'no's to remove is:

num_no_to_remove = (df['no_group'] == 0).sum()

And the wanted dataframe can be obtained by filtering:

df.loc[df['no_group'] > 0].drop(columns=['no_group'])

Result:

    id  outcome
3    3      yes
4    3       no
7    5      yes
8    5       no
9    5      yes
12   6      yes
13   6       no
14   6       no
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for attempting it too @user17242583
1

For keeping only last no values of each group and all the yes values, this code will do the trick:

df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]

Output:

>>> df
    id outcome
3    3     yes
4    3      no
5    3      no
8    5     yes
9    5     yes
12   6     yes

(In the original df, I added a second no the end of group 3 to make sure it works for multiple no's at the end).

Essentially what the code does is it

  1. Replaces yes values with an arbitrary value (1) in this case
  2. Replaces no values with NaN (which is important!)
  3. Groups the rows by their ID
  4. For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the yes's are 1 and the no's are NaN, this will cause everything except the last no's of the group to be replaced with the arbitrary number (1)
  5. Creates a mask which selects all those last no values of each group
  6. Creates a second mask which selects all yes values
  7. Uses those two masks combined to return , and all yes values, and all no values that are at the end of a group.

For the question regarding the count of no's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.

3 Comments

Hi, thanks for the answer. Apologies, I should have made this clearer. I have updated the question to reflect this. I want to also keep any 'no' outcomes inbetween 'yes'. Basically only removing the 'no' values at the start of a sequence (per id).
Okay, that shouldn't be too hard. Let me see...
So you want to remove no's at the beginning of each group? But your sample data only has no's at the beginning or end or both of each group, but not in between any yes's in each group.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.