3

I have two dataframes (let's name them M and K) that come from different sources. They have different columns names and the only one column that is the same in both dataframes is ID column (M[id] == K[id]).

A number of rows in both dataframes are equal; a number of columns are different.

The goal is to create a matrix which will how many columns have the same values for the same ID (or row). The size of the matrix (MK) is M.columns X K.columns. Each cell is store count of matched values for the pair of M.column and K.column. Tha maximum number in the cell is the count of rows for M or K, as they are the same. Missing values (NaN) should be ignored.

Let talk in figures =)

data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'm1': ['a', 'b', 'c', 'd', 'e', 2],
        'm2': [1, 2, 3, 4, np.nan, 1],
        'm3': ['aa','b','cc','d','ff', 3],
        'm4': [4, 6, 3, 4, np.nan, 2],
        'm5': ['b', 6, 'a', 4, np.nan, 1],
        }
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'k1': ['z', 'bb', 'c', 'd', 'e', 4],
        'k2': [1, 2, 32, 5, np.nan, 1],
        'k3': ['aa','b','cc','d','ff', 1],
        'k4': [4, 2, 2, 4, np.nan, 4],
        'k5': [4, 1, 'as', 4, np.nan, 2],
        'k6': ['aa', 1, 'a', 3, np.nan, 2],
        }
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5']) 
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])

M and K output

M
Out[2]: 
    id m1   m2  m3   m4   m5
0  id1  a  1.0  aa  4.0    b
1  id2  b  2.0   b  6.0    6
2  id3  c  3.0  cc  3.0    a
3  id4  d  4.0   d  4.0    4
4  id5  e  NaN  ff  NaN  NaN
5  id6  2  1.0   3  2.0    1

K
Out[3]: 
    id  k1    k2  k3   k4   k5   k6
0  id1   z   1.0  aa  4.0    4   aa
1  id2  bb   2.0   b  2.0    1    1
2  id3   c  32.0  cc  2.0   as    a
3  id4   d   5.0   d  4.0    4    3
4  id5   e   NaN  ff  NaN  NaN  NaN
5  id6   4   1.0   1  4.0    2    2

Afte the first compare for id=='id1' the MK matrix should look something like this:

    id  m1  m2  m3  m4  m5
id  1   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   1   0   0   0
k3  0   0   0   1   0   0
k4  0   0   0   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

On the second one (id=='id2') it should be next:

    id  m1  m2  m3  m4  m5
id  2   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   2   0   0   0
k3  0   0   0   2   0   0
k4  0   0   1   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

At the very end, each cell will be transformed to the percentage of matched values.

And the last one. Theoretically, it could be more that one row for each ID. However, it is not the case for the current issue. But if you have inspiration, you are welcome to solve the 'general case' ^_^

Many thanks.

1 Answer 1

4

Approach using numpy broadcasting and pd.Panel

m = M.values[:, 1:]
k = K.values[:, 1:]

p = pd.Panel(
    (m[:, None] == k[:, :, None]).astype(np.uint8),
    M.id.values, K.columns[1:], M.columns[1:])

then access for each id

p['id1']

    m1  m2  m3  m4  m5
k1   0   0   0   0   0
k2   0   1   0   0   0
k3   0   0   1   0   0
k4   0   0   0   1   0
k5   0   0   0   1   0
k6   0   0   1   0   0

Or using pandas groupby

df = M.set_index('id').join(K.set_index('id'))

def row_comp(r):
    m = r.filter(like='m')
    k = r.filter(like='k')
    return pd.DataFrame(
        (m.values == k.values.T).astype(np.uint8),
        k.columns, m.columns
    )


df.groupby(level=0).apply(row_comp)

        m1  m2  m3  m4  m5
id                        
id1 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   1   0
    k5   0   0   0   1   0
    k6   0   0   1   0   0
id2 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   0
id3 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   1
id4 k1   1   0   1   0   0
    k2   0   0   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   1   1
    k5   0   1   0   1   1
    k6   0   0   0   0   0
id5 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   1
    k6   0   0   0   0   1
id6 k1   0   0   0   0   0
    k2   0   1   0   0   1
    k3   0   1   0   0   1
    k4   0   0   0   0   0
    k5   1   0   0   1   0
    k6   1   0   0   1   0
Sign up to request clarification or add additional context in comments.

3 Comments

Very clever solution !
@piRSquared Game over
@Boud I'm not sure what that means... but I lol'd anyway. And then my ring tone played softly in my head... Zelda Theme Song

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.