0

I have a dataframe similar to the following example:

import pandas as pd
data = pd.DataFrame(data={'col1': [1,2,3,4,5,6,7,8,9], 'col2': [1.55,1.55,1.55,1.8,1.9,1.9,1.9,2.1,2.1]})

In the second column, col2, several duplicate values can be seen, 3 times 1.55, 3 times 1.9 and 2 times 2.1. What I need to do is remove all rows which are a duplicate of its previous row. So, the first rows are the ones I'd like to keep. In this example, this would be the rows with col2 value 1, 4, 5, 8 giving the following dataframe as my desired output:

clean_data = pd.DataFrame(data={'col1': [1,4,5,8], 'col2': [1.55,1.8,1.9,2.1]})

What is the best way to go about this for a dataframe which is much larger (in terms of rows) than this small example?

4
  • Do you want to remove rows that are a duplicate of just the immediately previous rows, or rows that are a duplicate of any of the previous rows? Commented Nov 16, 2022 at 17:32
  • Only of the immediate previous row, not of all previous rows. Sorry for the unclear description. Commented Nov 16, 2022 at 17:33
  • Rereading your question, I think your intent is clear; my mistake. Commented Nov 16, 2022 at 17:34
  • For posterity: if you want to remove rows where the col2 entry is a duplicate of any of the preceding values, you can do clean_data = data.loc[~data['col2'].duplicated(),:] Commented Nov 16, 2022 at 17:36

1 Answer 1

1

You can use shift:

data.loc[data['col2'] != data['col2'].shift(1)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.