1

I have two dataframes ( df1 & df2) with same headings.

Both dataframes contains same number of rows.

There are around 20 columns in each dataframes.

The dataframes differs for some columns (any one or more out of 20 columns)

There is a particular column (ssno) which is unique for both.

I need to generate the output(ssno) for those rows which differes in any of the 20 field'.

Please help.

0

1 Answer 1

3

First compare both DataFrames and get al least one Trues per rows by any and then use boolean indexing for filtering ssno column:

df1 = pd.DataFrame({'ssno':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[70,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,90,2,4],
                   'F':list('aaXbbb')})

print (df1)
   B   C  D   E  F ssno
0  4  70  1   5  a    a
1  5   8  3   3  a    b
2  4   9  5   6  X    c
3  5   4  7  90  b    d
4  5   2  1   2  b    e
5  4   3  0   4  b    f

df2 = pd.DataFrame({'ssno':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df2)
   B  C  D  E  F ssno
0  4  7  1  5  a    a
1  5  8  3  3  a    b
2  4  9  5  6  a    c
3  5  4  7  9  b    d
4  5  2  1  2  b    e

s = df1.loc[(df1 != df2).any(1), 'ssno']
print (s)
0    a
2    c
3    d
Name: ssno, dtype: object

Detail:

print (df1 != df2)
       B      C      D      E      F   ssno
0  False   True  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False  False   True  False
3  False  False  False   True  False  False
4  False  False  False  False  False  False
5  False  False  False  False  False  False

print ((df1 != df2).any(1))
0     True
1    False
2     True
3     True
4    False
5    False
dtype: bool
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you @Jezrael. It helped me. Thanks a lot
Is it also possible to get the column name for index (0, 2, 3) which differs? Like for index 0, it differs at [a, d, e] etc
I think yes, do you need s2 = df1.columns[(df1 != df2).any()] ?
Or maybe s2 = df1[(df1 != df2)].stack().rename_axis(('idx','cols')).reset_index(name='val_df1') ?
Thanks Jezrael, your 2nd solution worked. I was looking for this only. Great. Thanks and regards.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.