Pandas remove all elements from one dataframe from another

Question

I want to find the difference between two dataframes (elements in df1, not in df2) based on a subset of columns. The two data frames have the same schema.

Say df1 contains

col1 col2 col3 col4
A    B    C    D
A    C    D    D

and df2 contains

col1 col2 col3 col4
A    D    D    D
A    B    D    D

and I wanted the items in df1, where there isn't an item in df2 where col1 and col2 match. So in this case the expected output would be just the 2nd row of df1.

A    C    D    D

I've tried different variations of isin, but I'm struggling to find anything that works. I tried https://stackoverflow.com/a/16704977/1639228 , but that only works for single columns.

Why do you say 'based on col1 and col2'? Your expected output seems more like the second row of df1 — logc
– logc, Commented Apr 10, 2014 at 18:39
The expected output is the 2nd row of df1. I mean that I want items in df1, but not df2, looking only at columns col1 and col2. — Manny
– Manny, Commented Apr 10, 2014 at 18:43

Rutger Kassies · Accepted Answer · 2014-04-11 07:38:40Z

2

The problem with using isin is that the index also has to match if you use a DataFrame. I dont know what your index is, but if its different where col1 and col2 are equal, it will stil return a negative result.

Converting your second DataFrame to a list will make it work (since that removes the index). The isin matches for both columns separately but with all(axis-1) you filter this down to the case where both match.

sub = ['col1', 'col2']
mask = df1[sub].isin(df2[sub].to_dict(outtype='list')).all(axis=1)

df1[~mask]

  col1 col2 col3 col4
1    A    C    D    D

answered Apr 11, 2014 at 7:38

Rutger Kassies

65k17 gold badges119 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Manny Over a year ago

Thanks for response, +1. I posted a solution I came up with. Do you know if either of these are more efficient? I'm not familiar with what pandas actually doing under the hood.

Raja · Accepted Answer · 2016-12-02 03:55:10Z

1

I know this is a very old question. But this comes on top on google if I search this problem. If there a column in both the dataframe where the values are unique it can be done like this

  uniq__value_list = df1[col1].tolist()
  df3 = df2[~df.col1.isin(uniq__value_list)]

Now, the third dataframe will have values that are in df1 but not df2.

answered Dec 2, 2016 at 3:55

Raja

6,5032 gold badges19 silver badges20 bronze badges

Comments

Manny · Accepted Answer · 2014-04-11 16:51:45Z

0

I don't know if this is efficient, but I found a way to do it after hours of experimenting. It involves first re-indexing the dataframes to use the columns you care about as the index.

df1.set_index(['col1', 'col2'], inplace=True)
df2.set_index(['col1', 'col2'], inplace=True)

df1[df1.index.map(lambda x: x not in df2.index)]

answered Apr 11, 2014 at 16:51

Manny

1952 silver badges13 bronze badges

Collectives™ on Stack Overflow

Pandas remove all elements from one dataframe from another

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related