0

I have two dataframes and want to join them based on three fields, A, B, and C. However, A and B are numeric values and I want to them match exactly in my join/merge but C is a string value and I want at least 80% match (similarity), i.e. if A and B have the same values in both dataframes and the value of C in the first dataframe is abcde and in the second one is abcdf I still want to consider this record in my result. How can I implement this in python?

2 Answers 2

2

You can using fuzzywuzzy

from fuzzywuzzy import fuzz

df1=pd.DataFrame({'A':[1,3,2],'B':[2,2,3],'C':['aad','aac','aad']})

df2=pd.DataFrame({'A':[1,2,2],'B':[2,2,3],'C':['aad','aab','acd']})

mergedf1=df1.merge(df2,on=['A','B'])

mergedf1['ratio']=[fuzz.ratio(x,y) for x, y in zip(mergedf1['C_x'],mergedf1['C_y'])]
mergedf1#score list here , you can cut the data frame by your own limit 
Out[265]: 
   A  B  C_x  C_y  ratio
0  1  2  aad  aad    100
1  2  3  aad  acd     67
Sign up to request clarification or add additional context in comments.

Comments

0

I would probably merge first on only A and B, then filter out any rows that have low similarity on the C column, so something like:

result = df1.merge(df2, on=['A', 'B'])

# assuming sim is the similarity function that you created to calculate the similarity
idx = result.apply(lambda x: sim(c['C_x', 'C_y']) >= 0.8, axis=1)
result = result[idx]

Hope it helps!

3 Comments

sim is a new function he need to made ?
@RafaelC I spend sometime to find the function sim ...LOL
@Wen Oh that's how the OP wants to calculate the similarity, edited, Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.