2

Having the data frames illustrated in the image below, I would like to merge on ['A','B','C'] and ['X','Y','Z'] first then gradually look for a match with one less column, I.E ['A','B'] and ['X','Y'] then ['A'] and ['X'] without duplicating the rows of the result, in the example below a,y,y,v3 is left out since a,d,d already matched.

enter image description here

My code so far, matches on all 3 columns:

df1 = pd.DataFrame({"A":['a','b','c'],"B":['d','e','f'],"C":['d','e','f']})
df2 = pd.DataFrame({"X":['a','b','a','c'],"Y":['d','e','y','z'],"Z":['d','x','y','z'],"V":['v1','v2','v3','v4']})

merged = pd.merge(df1,df2,left_on=['A','B','C'],right_on=['X','Y','Z'], how='left')
merged = merged.drop_duplicates(['A','B','C'])
merged.head()

enter image description here

How can I achieve my goal?

Update: expected output enter image description here

2
  • a,y,y,v3 is already left out because you already have a row that matches the whole 3 columns? Could you also add the expected output? Commented Dec 1, 2020 at 12:45
  • That is correct, I'll add the expected output Commented Dec 1, 2020 at 12:51

2 Answers 2

3

One idea with multiple merge in loop with DataFrame.drop_duplicates for second DataFrame what should avoid duplicated rows in final DataFrame:

from functools import reduce

dfs = []
L = [['A', 'B', 'C'], ['X', 'Y', 'Z']]

for i in range(len(L[0]), 0, -1):
    df22 = df2.drop_duplicates(L[1][:i])
    df = pd.merge(df1,df22,left_on=L[0][:i],right_on=L[1][:i], how='left')
    dfs.append(df)

df = reduce(lambda l,r: pd.DataFrame.fillna(l,r), dfs)
print (df)
   A  B  C  X  Y  Z   V
0  a  d  d  a  d  d  v1
1  b  e  e  b  e  x  v2
2  c  f  f  c  z  z  v4

working like:

merged1 = pd.merge(df1,df2.drop_duplicates(['X','Y','Z']),left_on=['A','B','C'],right_on=['X','Y','Z'], how='left')
merged2 = pd.merge(df1,df2.drop_duplicates(['X','Y']),left_on=['A','B'],right_on=['X','Y'], how='left')
merged3 = pd.merge(df1,df2.drop_duplicates('X'),left_on=['A'],right_on=['X'], how='left')

df = merged1.fillna(merged2).fillna(merged3)
print (df)
   A  B  C  X  Y  Z   V
0  a  d  d  a  d  d  v1
1  b  e  e  b  e  x  v2
2  c  f  f  c  z  z  v4
Sign up to request clarification or add additional context in comments.

1 Comment

can you please have a look here: stackoverflow.com/questions/65561016/… ?
1

What about this :

matches = [['A', 'B', 'C'], ['X', 'Y', 'Z']]
df = df1.copy()
for k in range(len(matches[0])):

    #Get your left/right keys right at each iteration :
    left, right = matches
    left = left if k==0 else left[:-k]
    right = right if k==0 else right[:-k]

    #Make sure columns from df2 exist in df
    for col in df2.columns.tolist():
        try:
            df[col]
        except Exception:
            df[col] = np.nan

    #Merge dataframes
    df = df.merge(df2, left_on=left, right_on=right, how='left')

    #Find which row of df's "left" columns (previously initialised) are empty
    ix_left_part = np.all([df[x + "_x"].isnull() for x in right], axis=0)

    #Find which row of df's "right" columns are not empty
    ix_right_part = np.all([df[x + "_y"].notnull() for x in right], axis=0)

    #Combine both to get indexes
    ix = df[ix_left_part & ix_right_part].index

    #Complete values on "left" with those from "right"
    for x in df2.columns.tolist():
        df.loc[ix, x+"_x"] = df.loc[ix, x+'_y']

    #Drop values from "right"
    df.drop([x+"_y" for x  in df2.columns.tolist()], axis=1, inplace=True)

    #Rename "left" columns to stick with original names from df2
    df.rename({x+"_x":x for x  in df2.columns.tolist()}, axis=1, inplace=True)

#drop eventual duplicates
df.drop_duplicates(keep="first", inplace=True)
print(df)

EDIT

I corrected the loop ; this should be easier on the memory :

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"A":['a','b','c'],"B":['d','e','f'],"C":['d','e','f']})
df2 = pd.DataFrame({"X":['a','b','a','c'],"Y":['d','e','y','z'],"Z":['d','x','y','z'],"V":['v1','v2','v3','v4']})

matches = [['A', 'B', 'C'], ['X', 'Y', 'Z']]
df = df1.copy()

#Make sure columns of df2 exist in df
for col in df2.columns.tolist():
    df[col] = np.nan

for k in range(len(matches[0])):

    #Get your left/right keys right at each iteration :
    left, right = matches
    left = left if k==0 else left[:-k]
    right = right if k==0 else right[:-k]
    
    #recreate dataframe of (potential) usable datas in df2:
    ix = df[df.V.isnull()].index
    temp = (
            df.loc[ix, left]
            .rename(dict(zip(left, right)), axis=1)
            )
    
    temp=temp.merge(df2, on=right, how="inner")
    
    #Merge dataframes
    df = df.merge(temp, left_on=left, right_on=right, how='left')
    
    
    #Combine both to get indexes
    ix = df[(df['V_x'].isnull()) & (df['V_y'].notnull())].index
    

    #Complete values on "left" with those from "right"
    cols_left = [x+'_x' for x in df2.columns.tolist()]
    cols_right = [x+'_y' for x in df2.columns.tolist()]    
    df.loc[ix, cols_left] = df.loc[ix, cols_right].values.tolist()
        
    #Drop values from "right"
    df.drop(cols_right, axis=1, inplace=True)
    
    #Rename "left" columns to stick with original names from df2
    rename = {x+"_x":x for x  in df2.columns.tolist()}
    df.rename(rename, axis=1, inplace=True)

print(df)

7 Comments

I think I messed up something around the "ix_left_part", you should'nt get duplicates at the end... I think I now what it is, but correcting it will depends if there is one "target" column (meaning 'V' on your real df2).
there is one target, the V column. I'll test the code and let you know
Your code works on the dummy data I provided, kudos for that...however, It consumes a lot of memory and crash my environment when I run it on the actual data :(
What's the shape of you dataframe ?
df1 is (10000, 3) and df2 is (137503, 3) , there are only 2 columns to match in my actual data, but I was looking for a generic answer, like the one you provided.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.