0

I have 2 pandas dataframes that both contain company names. I want to merge these 2 dataframes on company names using a fuzzy match. But the problem is 1 dataframe contains 5m rows and the other 1 contains about 10k rows, so it takes forever for my fuzzy match to run. I would like to know if there's any efficient way to do so?

These are the codes I'm using right now:

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
df_1['matches'] = m

m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2

return df_1

And these are some sample data from df1 and df2.

df1

df1_ID Company Name
AB0091 Apple
AC0092 Microsoft

df2

df2_ID Company Name
F001ABC Appl
E002ABG The microst

As you can see the company names may include some typo and differences in 2 dataframes, and there's no other column I can use to do the merge, so that's why I need a fuzzy match on company names. The end goal is to efficiently use company name to match these 2 large dataframes.

Thank you!

3
  • What approaches have you already considered? Commented Apr 25, 2024 at 21:23
  • You might look at encoding the company names using something like 'soundex' to group like names before using the fuzzy logic to select appropriate pairs. Commented Apr 25, 2024 at 21:36
  • @itprorh66 I have tried the function I mentioned above, also the difflib function. I wanted to use the fuzzy function in excel, but one of my dataframe exceeds the excel limitation (> 5m rows) so I don't know what else can I do. Commented Apr 25, 2024 at 21:36

1 Answer 1

0

One possible approach is using rapidfuzz and the process.extractOne() method in the following manner:

import pandas as pd
from rapidfuzz import process

data1 = {
    'df1_ID': ['AB0091', 'AC0092'],
    'Company Name': ['Apple', 'Microsoft']
}
df1 = pd.DataFrame(data1)

data2 = {
    'df2_ID': ['F001ABC', 'E002ABG'],
    'Company Name': ['Appl', 'The microst']
}
df2 = pd.DataFrame(data2)

def fuzzy_match(row, df2, key, threshold=70):
    best_match = process.extractOne(row['Company Name'], df2['Company Name'], score_cutoff=threshold)
    if best_match:
        matched_id = df2.loc[df2['Company Name'] == best_match[0], 'df2_ID'].values[0]
        return pd.Series([best_match[0], matched_id, best_match[1]])
    return pd.Series([None, None, None])

df1[['Matched Company', 'Matched df2_ID', 'Match Score']] = df1.apply(fuzzy_match, axis=1, df2=df2, key='Company Name')

print(df1)

which will return:

   df1_ID Company Name Matched Company Matched df2_ID  Match Score
0  AB0091        Apple            Appl        F001ABC    88.888889
1  AC0092    Microsoft            None           None          NaN

The output is totally dependent on the threshold you choose. 90 did not match anything, while 70 matched Apple. For a match on Microsoft, you would need to go lower (50),

   df1_ID Company Name Matched Company Matched df2_ID  Match Score
0  AB0091        Apple            Appl        F001ABC    88.888889
1  AC0092    Microsoft     The microst        E002ABG    60.000000

but you may get unreasonable matches on other things:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.