How would I join two dataframe based on a partial string match?

Question

I have two dataframes and want to join them based on three fields, A, B, and C. However, A and B are numeric values and I want to them match exactly in my join/merge but C is a string value and I want at least 80% match (similarity), i.e. if A and B have the same values in both dataframes and the value of C in the first dataframe is abcde and in the second one is abcdf I still want to consider this record in my result. How can I implement this in python?

BENY · Accepted Answer · 2018-07-27 19:48:13Z

2

You can using fuzzywuzzy

from fuzzywuzzy import fuzz

df1=pd.DataFrame({'A':[1,3,2],'B':[2,2,3],'C':['aad','aac','aad']})

df2=pd.DataFrame({'A':[1,2,2],'B':[2,2,3],'C':['aad','aab','acd']})

mergedf1=df1.merge(df2,on=['A','B'])

mergedf1['ratio']=[fuzz.ratio(x,y) for x, y in zip(mergedf1['C_x'],mergedf1['C_y'])]
mergedf1#score list here , you can cut the data frame by your own limit 
Out[265]: 
   A  B  C_x  C_y  ratio
0  1  2  aad  aad    100
1  2  3  aad  acd     67

answered Jul 27, 2018 at 19:48

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bubble Bubble Bubble Gut · Accepted Answer · 2018-07-27 19:51:19Z

0

I would probably merge first on only A and B, then filter out any rows that have low similarity on the C column, so something like:

result = df1.merge(df2, on=['A', 'B'])

# assuming sim is the similarity function that you created to calculate the similarity
idx = result.apply(lambda x: sim(c['C_x', 'C_y']) >= 0.8, axis=1)
result = result[idx]

Hope it helps!

edited Jul 27, 2018 at 19:51

answered Jul 27, 2018 at 19:44

Bubble Bubble Bubble Gut

3,37617 silver badges30 bronze badges

3 Comments

BENY Over a year ago

sim is a new function he need to made ?

BENY Over a year ago

@RafaelC I spend sometime to find the function sim ...LOL

Bubble Bubble Bubble Gut Over a year ago

@Wen Oh that's how the OP wants to calculate the similarity, edited, Thanks!

Collectives™ on Stack Overflow

How would I join two dataframe based on a partial string match?

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related