Python Pandas: Compare values of two dataframes with different columns names by ID or row

Question

I have two dataframes (let's name them M and K) that come from different sources. They have different columns names and the only one column that is the same in both dataframes is ID column (M[id] == K[id]).

A number of rows in both dataframes are equal; a number of columns are different.

The goal is to create a matrix which will how many columns have the same values for the same ID (or row). The size of the matrix (MK) is M.columns X K.columns. Each cell is store count of matched values for the pair of M.column and K.column. Tha maximum number in the cell is the count of rows for M or K, as they are the same. Missing values (NaN) should be ignored.

Let talk in figures =)

data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'm1': ['a', 'b', 'c', 'd', 'e', 2],
        'm2': [1, 2, 3, 4, np.nan, 1],
        'm3': ['aa','b','cc','d','ff', 3],
        'm4': [4, 6, 3, 4, np.nan, 2],
        'm5': ['b', 6, 'a', 4, np.nan, 1],
        }
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'k1': ['z', 'bb', 'c', 'd', 'e', 4],
        'k2': [1, 2, 32, 5, np.nan, 1],
        'k3': ['aa','b','cc','d','ff', 1],
        'k4': [4, 2, 2, 4, np.nan, 4],
        'k5': [4, 1, 'as', 4, np.nan, 2],
        'k6': ['aa', 1, 'a', 3, np.nan, 2],
        }
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5']) 
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])

M and K output

M
Out[2]: 
    id m1   m2  m3   m4   m5
0  id1  a  1.0  aa  4.0    b
1  id2  b  2.0   b  6.0    6
2  id3  c  3.0  cc  3.0    a
3  id4  d  4.0   d  4.0    4
4  id5  e  NaN  ff  NaN  NaN
5  id6  2  1.0   3  2.0    1

K
Out[3]: 
    id  k1    k2  k3   k4   k5   k6
0  id1   z   1.0  aa  4.0    4   aa
1  id2  bb   2.0   b  2.0    1    1
2  id3   c  32.0  cc  2.0   as    a
3  id4   d   5.0   d  4.0    4    3
4  id5   e   NaN  ff  NaN  NaN  NaN
5  id6   4   1.0   1  4.0    2    2

Afte the first compare for id=='id1' the MK matrix should look something like this:

    id  m1  m2  m3  m4  m5
id  1   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   1   0   0   0
k3  0   0   0   1   0   0
k4  0   0   0   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

On the second one (id=='id2') it should be next:

    id  m1  m2  m3  m4  m5
id  2   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   2   0   0   0
k3  0   0   0   2   0   0
k4  0   0   1   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

At the very end, each cell will be transformed to the percentage of matched values.

And the last one. Theoretically, it could be more that one row for each ID. However, it is not the case for the current issue. But if you have inspiration, you are welcome to solve the 'general case' ^_^

Many thanks.

piRSquared · Accepted Answer · 2017-03-28 18:55:03Z

4

Approach using numpy broadcasting and pd.Panel

m = M.values[:, 1:]
k = K.values[:, 1:]

p = pd.Panel(
    (m[:, None] == k[:, :, None]).astype(np.uint8),
    M.id.values, K.columns[1:], M.columns[1:])

then access for each id

p['id1']

    m1  m2  m3  m4  m5
k1   0   0   0   0   0
k2   0   1   0   0   0
k3   0   0   1   0   0
k4   0   0   0   1   0
k5   0   0   0   1   0
k6   0   0   1   0   0

Or using pandas groupby

df = M.set_index('id').join(K.set_index('id'))

def row_comp(r):
    m = r.filter(like='m')
    k = r.filter(like='k')
    return pd.DataFrame(
        (m.values == k.values.T).astype(np.uint8),
        k.columns, m.columns
    )


df.groupby(level=0).apply(row_comp)

        m1  m2  m3  m4  m5
id                        
id1 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   1   0
    k5   0   0   0   1   0
    k6   0   0   1   0   0
id2 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   0
id3 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   1
id4 k1   1   0   1   0   0
    k2   0   0   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   1   1
    k5   0   1   0   1   1
    k6   0   0   0   0   0
id5 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   1
    k6   0   0   0   0   1
id6 k1   0   0   0   0   0
    k2   0   1   0   0   1
    k3   0   1   0   0   1
    k4   0   0   0   0   0
    k5   1   0   0   1   0
    k6   1   0   0   1   0

edited Mar 28, 2017 at 18:55

answered Mar 28, 2017 at 18:47

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MaxU - stand with Ukraine Over a year ago

Very clever solution !

Zeugma Over a year ago

@piRSquared Game over

piRSquared Over a year ago

@Boud I'm not sure what that means... but I lol'd anyway. And then my ring tone played softly in my head... Zelda Theme Song

Collectives™ on Stack Overflow

Python Pandas: Compare values of two dataframes with different columns names by ID or row

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related