Difference between two dataframe and output a specific column

Question

I have two dataframes ( df1 & df2) with same headings.

Both dataframes contains same number of rows.

There are around 20 columns in each dataframes.

The dataframes differs for some columns (any one or more out of 20 columns)

There is a particular column (ssno) which is unique for both.

I need to generate the output(ssno) for those rows which differes in any of the 20 field'.

Please help.

jezrael · Accepted Answer · 2018-02-10 08:42:19Z

3

First compare both DataFrames and get al least one Trues per rows by any and then use boolean indexing for filtering ssno column:

df1 = pd.DataFrame({'ssno':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[70,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,90,2,4],
                   'F':list('aaXbbb')})

print (df1)
   B   C  D   E  F ssno
0  4  70  1   5  a    a
1  5   8  3   3  a    b
2  4   9  5   6  X    c
3  5   4  7  90  b    d
4  5   2  1   2  b    e
5  4   3  0   4  b    f

df2 = pd.DataFrame({'ssno':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df2)
   B  C  D  E  F ssno
0  4  7  1  5  a    a
1  5  8  3  3  a    b
2  4  9  5  6  a    c
3  5  4  7  9  b    d
4  5  2  1  2  b    e

s = df1.loc[(df1 != df2).any(1), 'ssno']
print (s)
0    a
2    c
3    d
Name: ssno, dtype: object

Detail:

print (df1 != df2)
       B      C      D      E      F   ssno
0  False   True  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False  False   True  False
3  False  False  False   True  False  False
4  False  False  False  False  False  False
5  False  False  False  False  False  False

print ((df1 != df2).any(1))
0     True
1    False
2     True
3     True
4    False
5    False
dtype: bool

edited Feb 10, 2018 at 8:42

answered Feb 10, 2018 at 8:35

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Arijit Ghosh Over a year ago

Thank you @Jezrael. It helped me. Thanks a lot

Arijit Ghosh Over a year ago

Is it also possible to get the column name for index (0, 2, 3) which differs? Like for index 0, it differs at [a, d, e] etc

jezrael Over a year ago

I think yes, do you need s2 = df1.columns[(df1 != df2).any()] ?

jezrael Over a year ago

Or maybe s2 = df1[(df1 != df2)].stack().rename_axis(('idx','cols')).reset_index(name='val_df1') ?

Arijit Ghosh Over a year ago

Thanks Jezrael, your 2nd solution worked. I was looking for this only. Great. Thanks and regards.

Collectives™ on Stack Overflow

Difference between two dataframe and output a specific column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related