1

here is data example:

import pandas as pd
df = pd.DataFrame({
    'file': ['file1','file2','file1','file2','file3','file3','file4','file5','file4','file5'],
    'prop1': ['True','False','True','False','False','False','False','True','False','False'],
    'prop2': ['False','False','False','False','True','False','True','False','True','False'],
    'prop3': ['False','True','False','True','False','True','False','False','False','True']
})

file    prop1   prop2   prop3
0   file1   True    False   False
1   file2   False   False   True
2   file1   True    False   False
3   file2   False   False   True
4   file3   False   True    False
5   file3   False   False   True
6   file4   False   True    False
7   file5   True    False   False
8   file4   False   True    False
9   file5   False   False   True

I need to drop duplicated rows with same props values to another dataframe and cut them off original file.
So another dataframe should looks like this (duplicated rows should not repeat):

file    prop1   prop2   prop3
0   file1   True    False   False
3   file2   False   False   True
8   file4   False   True    False

df = df.drop_duplicates() drop onlu 1 duplicated row, but not second like this:

    file    prop1   prop2   prop3
0   file1   True    False   False
1   file2   False   False   True
4   file3   False   True    False
5   file3   False   False   True
6   file4   False   True    False
7   file5   True    False   False
9   file5   False   False   True
5
  • Have you tried drop_duplicates? Commented Oct 7, 2019 at 14:12
  • df.drop_duplicates() Commented Oct 7, 2019 at 14:13
  • try: new_df = df.loc[df.duplicated()].copy() to store duplicated values into a new dataframe Commented Oct 7, 2019 at 14:13
  • Not sure there's a simple way to get the exact indices you show in your expected output. But would suffice to do df.drop_duplicates(subset=[f'prop{i}' for i in range(1,4)]) Commented Oct 7, 2019 at 14:15
  • yea drop_duplicated works, but i also need to cut duplicated rows off the dataframe Commented Oct 7, 2019 at 14:16

2 Answers 2

1
uniques = df.drop_duplicates()
duplicates = df.iloc[list(set(df.index) - set(uniques.index))]

You can use the pandas method drop_duplicates() first to create a dataframe with only the unique rows. You can then compare the indices of your original dataframe and the indices in the frame with unique rows, the 'dropped' indices are your duplicate rows, which you can copy again from your original dataframe in order to now have your unique rows and duplicated rows seperated.

Sign up to request clarification or add additional context in comments.

Comments

1

Use DataFrame.drop_duplicates with specify columns names by selecting - all columns without first:

df = df.drop_duplicates(df.columns[1:])

Or seelct columns with prop in columns names:

df = df.drop_duplicates(df.filter(like='prop').columns)

print (df)
    file  prop1  prop2  prop3
0  file1   True  False  False
1  file2  False  False   True
4  file3  False   True  False

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.