Pandas - Remove duplicates from two dataframes with different columns

Question

I have two dataframes: 1 major df and 1 with rows that I want to delete in the major one (dfmatch). The major df has more columns than the dfmatch.

I only want to delete the rows in major df if column1, column2 AND column3 equals with the value in the corresponinding columns of dfmatch.

Column extra1 and extra2 should be available in dfnew as well.

My current script only shows the column headers instead of the remaining rows:

file = 'testdf.csv'
colnames=['column1', 'column2', 'column3', 'extra1', 'extra2'] 
df = pd.read_csv(file, names=colnames, header=None)

file = 'testdfmatch.csv'
colnames=['column1', 'column2', 'column3'] 
dfmatch = pd.read_csv(file, names=colnames, header=None)

dfnew = pd.concat([dfmatch,df,df], sort=False).drop_duplicates(['column1', 'column2', 'column3'], keep=False)

wwnde · Accepted Answer · 2020-09-14 20:50:53Z

2

Sample data would have been useful. Lets try pd.merge, indicator=

dfnew  = pd.merge(df, dfmatch, how='left', indicator='Exist')
dfnew  = dfnew .loc[dfnew ['Exist'] != 'both']
dfnew.drop(columns=['Exist'], inplace=True) 
print(dfnew)

answered Sep 14, 2020 at 20:50

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

NikosVavlas · Accepted Answer · 2020-09-15 11:32:06Z

0

The above code does what you want.

dfnew=df.append(dfmatch,ignore_index=True)
defnew.drop_duplicates(subset=['column1', 'column2', 'column3'],
                 keep = 'first', inplace = True)

It adds dfmatch below df creating dfnew. Then it removes the duplicate rows only using column1, 2 and 3 as a subset. It keeps only the first occurrence that corresponds to the initial rows from df which include extra1 and extra2.

I wouldn't suggest though using float values as a subset due to the float precision handling in python. Rows with NaN on extra1 and extra2 indicate that were originally on dfmatch.

edited Sep 15, 2020 at 11:32

answered Sep 15, 2020 at 11:21

NikosVavlas

112 bronze badges

4 Comments

Scripter Over a year ago

The "duplicate" rows should be removed, instead of keeping the first occurence.

NikosVavlas Over a year ago

@Scripter Then you can set keep=False to remove all duplicates.

Scripter Over a year ago

Thanks. Does this answer has memory advantages compared to the other answer?

NikosVavlas Over a year ago

@Scripter I am not sure about that. I ll have to try the previously suggested solutions myself to see if it does.

Collectives™ on Stack Overflow

Pandas - Remove duplicates from two dataframes with different columns

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related