1

I want to find the difference between two dataframes (elements in df1, not in df2) based on a subset of columns. The two data frames have the same schema.

Say df1 contains

col1 col2 col3 col4
A    B    C    D
A    C    D    D

and df2 contains

col1 col2 col3 col4
A    D    D    D
A    B    D    D

and I wanted the items in df1, where there isn't an item in df2 where col1 and col2 match. So in this case the expected output would be just the 2nd row of df1.

A    C    D    D

I've tried different variations of isin, but I'm struggling to find anything that works. I tried https://stackoverflow.com/a/16704977/1639228 , but that only works for single columns.

2
  • Why do you say 'based on col1 and col2'? Your expected output seems more like the second row of df1 Commented Apr 10, 2014 at 18:39
  • The expected output is the 2nd row of df1. I mean that I want items in df1, but not df2, looking only at columns col1 and col2. Commented Apr 10, 2014 at 18:43

3 Answers 3

2

The problem with using isin is that the index also has to match if you use a DataFrame. I dont know what your index is, but if its different where col1 and col2 are equal, it will stil return a negative result.

Converting your second DataFrame to a list will make it work (since that removes the index). The isin matches for both columns separately but with all(axis-1) you filter this down to the case where both match.

sub = ['col1', 'col2']
mask = df1[sub].isin(df2[sub].to_dict(outtype='list')).all(axis=1)

df1[~mask]

  col1 col2 col3 col4
1    A    C    D    D
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for response, +1. I posted a solution I came up with. Do you know if either of these are more efficient? I'm not familiar with what pandas actually doing under the hood.
1

I know this is a very old question. But this comes on top on google if I search this problem. If there a column in both the dataframe where the values are unique it can be done like this

  uniq__value_list = df1[col1].tolist()
  df3 = df2[~df.col1.isin(uniq__value_list)]

Now, the third dataframe will have values that are in df1 but not df2.

Comments

0

I don't know if this is efficient, but I found a way to do it after hours of experimenting. It involves first re-indexing the dataframes to use the columns you care about as the index.

df1.set_index(['col1', 'col2'], inplace=True)
df2.set_index(['col1', 'col2'], inplace=True)

df1[df1.index.map(lambda x: x not in df2.index)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.