I have two dataframes (let's name them M and K) that come from different sources. They have different columns names and the only one column that is the same in both dataframes is ID column (M[id] == K[id]).
A number of rows in both dataframes are equal; a number of columns are different.
The goal is to create a matrix which will how many columns have the same values for the same ID (or row). The size of the matrix (MK) is M.columns X K.columns. Each cell is store count of matched values for the pair of M.column and K.column. Tha maximum number in the cell is the count of rows for M or K, as they are the same. Missing values (NaN) should be ignored.
Let talk in figures =)
data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'm1': ['a', 'b', 'c', 'd', 'e', 2],
'm2': [1, 2, 3, 4, np.nan, 1],
'm3': ['aa','b','cc','d','ff', 3],
'm4': [4, 6, 3, 4, np.nan, 2],
'm5': ['b', 6, 'a', 4, np.nan, 1],
}
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'k1': ['z', 'bb', 'c', 'd', 'e', 4],
'k2': [1, 2, 32, 5, np.nan, 1],
'k3': ['aa','b','cc','d','ff', 1],
'k4': [4, 2, 2, 4, np.nan, 4],
'k5': [4, 1, 'as', 4, np.nan, 2],
'k6': ['aa', 1, 'a', 3, np.nan, 2],
}
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5'])
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])
M and K output
M
Out[2]:
id m1 m2 m3 m4 m5
0 id1 a 1.0 aa 4.0 b
1 id2 b 2.0 b 6.0 6
2 id3 c 3.0 cc 3.0 a
3 id4 d 4.0 d 4.0 4
4 id5 e NaN ff NaN NaN
5 id6 2 1.0 3 2.0 1
K
Out[3]:
id k1 k2 k3 k4 k5 k6
0 id1 z 1.0 aa 4.0 4 aa
1 id2 bb 2.0 b 2.0 1 1
2 id3 c 32.0 cc 2.0 as a
3 id4 d 5.0 d 4.0 4 3
4 id5 e NaN ff NaN NaN NaN
5 id6 4 1.0 1 4.0 2 2
Afte the first compare for id=='id1' the MK matrix should look something like this:
id m1 m2 m3 m4 m5
id 1 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 1 0 0 0
k3 0 0 0 1 0 0
k4 0 0 0 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
On the second one (id=='id2') it should be next:
id m1 m2 m3 m4 m5
id 2 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 2 0 0 0
k3 0 0 0 2 0 0
k4 0 0 1 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
At the very end, each cell will be transformed to the percentage of matched values.
And the last one. Theoretically, it could be more that one row for each ID. However, it is not the case for the current issue. But if you have inspiration, you are welcome to solve the 'general case' ^_^
Many thanks.