1

I have two data frames:

df1

A1    B1
1     a
2     s
3     d

and

df2

A1    B1
1     a
2     x
3     d

I want to compare df1 and df2 on column B1. The column A1 can be used to join. I want to know:

  1. Which rows are different in df1 and df2 with respect to column B1?
  2. If there is a mismatch in the values of column A1. For example whether df2 is missing some values that are there in df1 and vice versa. And if so, which ones?

I tried using merge and join but that is not what I am looking for.

1
  • 1. df1['B1'] == df2['B1'] 2. can you explain and post desired output as it'd unclear to me what you mean Commented Dec 8, 2015 at 16:33

1 Answer 1

7

I've edited the raw data to illustrate the case of A1 keys in one dataframe but not the other.

When doing your merge, you want to specify an 'outer' merge so that you can see these items with an A1 key in one dataframe but not the other.

I've included the suffixes '_1' and '_2' to indicate the dataframe source (_1 = df1 and _2 = df2) of column B1.

df1 = pd.DataFrame({'A1': [1, 2, 3, 4], 'B1': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'A1': [1, 2, 3, 5], 'B1': ['a', 'd', 'c', 'e']})

df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'])
df3['check'] = df3.B1_1 == df3.B1_2

>>> df3
   A1 B1_1 B1_2  check
0   1    a    a   True
1   2    b    d  False
2   3    c    c   True
3   4    d  NaN  False
4   5  NaN    e  False

To check for missing A1 keys in df1 and df2:

# A1 value missing in `df1`
>>> d3[df3.B1_1.isnull()]
   A1 B1_1 B1_2  check
4   5  NaN    e  False

# A1 value missing in `df2`
>>> df3[df3.B1_2.isnull()]
   A1 B1_1 B1_2  check
3   4    d  NaN  False

EDIT Thanks to @EdChum (the source of all Pandas knowledge...).

df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'], indicator=True)
df3['check'] = df3.B1_1 == df3.B1_2

>>> df3
   A1 B1_1 B1_2      _merge  check
0   1    a    a        both   True
1   2    b    d        both  False
2   3    c    c        both   True
3   4    d  NaN   left_only  False
4   5  NaN    e  right_only  False
Sign up to request clarification or add additional context in comments.

6 Comments

If using a recent version of pandas you can use pd.merge and pass indicator=True to add a column to show whether the row is only in left, right, both
perfect! seems like just what i needed. slight issue. when i try it on my data i dont see the missing rows. there are a few records that are present in one of the dataframes and they dont show up after the merge. however, your solution works perfectly with the test data that you have provided here. thanks
any idea why that might be happening? i am reading the data in from some csv files
You would need to post the data where you are having issues, probably as a new question.
@EdChum is there any way to compare multiple columns?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.