performance issues applying double lambda functions

Question

I have two dataframes: df1, df2 which contain each a column with names. I compare every name in df1 with every name in df2. This has to be an approximate match. Iam using fuzzywuzzy token_sort_ratio to get a comparison score.

However this method is very slow and df2 keeps growing, it already takes more then half an hour (4k x 2k rows). Is there a way to speed up the process?

My current implementation:

def match(df2,name):
    df2['score'] = df2['name'].map(lambda x: fuzz.token_sort_ratio(x, name))
        return df2.loc[(df2['score'].idxmax())

df1['result']=df1['name'].map(lambda x: match(df2,x))

E. Zeytinci · Accepted Answer · 2019-10-18 13:32:15Z

1

You can try this,

from fuzzywuzzy import fuzz

def similarity(name1, name2):
    return fuzz.token_sort_ratio(name1, name2)

df1['key'] = 1
df2['key'] = 1
merged = df1.merge(df2, on='key')

merged['name_score'] = merged[['name_x', 'name_y']] \
    .apply(lambda row: similarity(row['name_x'], row['name_y']), axis=1)

or,

from fuzzywuzzy import fuzz

def similarity(name1, name2):
    return fuzz.token_sort_ratio(name1, name2)

df1['key'] = 1
df2['key'] = 1
merged = df1.merge(df2, on='key')

scores = list(map(similarity, merged['name_x'], merged['name_y']))
merged['name_score'] = scores

edited Oct 18, 2019 at 13:32

answered Oct 18, 2019 at 13:22

E. Zeytinci

2,6332 gold badges23 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

performance issues applying double lambda functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related