5

I am trying to approximately match 600,000 individuals names (Full name) to another database that has over 87 millions observations (Full name) !

My first attempt with fuzzywuzzy library was way too slow, so I decided to use the module fuzzyset which is much faster. Assuming I have a computer powerful enough to load all the dataset in memory, I am doing the following with a test file of 964 observations to be matched against 50,000 observations:

import time
from cfuzzyset import cFuzzySet as FuzzySet

df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against

a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
   a.add(str(row)) # Fill the FuzzySet object with all names from file2

start_time = time.time() # Start recording the time

dicto={'index':[],'name':[]} # Dictionary where I store the output

for names in file1['f_ofulln']:
    dicto['index'].append(a.get(names)[0][0])
    dicto['name'].append(a.get(names)[0][1])

print("--- %s seconds ---" % (time.time() - start_time))   

>>> --- 39.68284249305725 seconds ---

With a much smaller dataset (964 observations matched against 50,000 observations), the time was 39 sec.

However, this is too slow if I want to perform this method on the full dataset.

Does anyone has an idea of how to improve the run time ? I think that Cython is not a possibility since I am already importing the Cython version of fuzzyset module

Many thanks,

Adrien

2
  • 1
    This problem is a lot harder than it seems. You might have to block (or clustering) 87M data first before you perform matching. I wouldn't suggest finding all distance between 600k name to every names in database. Python library called dedupe has some implementation of blocking technique. However, I'm not sure if it's going to scale to dataset that you have. Another possibility is to drop_duplicate names in both sets that you have before you perform fuzzy matching. (sorry for my vague answer...) Commented Jul 19, 2016 at 23:25
  • Thanks titpat, I will definitely do as much as possible to reduce the size of datasets to be merged. I also modified the question, since I discovered that the fuzzyset module is much faster. Last, I found ressources to store the entire dataframe in memory. But even with all that, the final run time is still counted in weeks ! Commented Jul 20, 2016 at 6:22

1 Answer 1

3

So I am going to answer my own question since I found a way that is pretty fast.

I saved both databases in HDF5 format, using the panda.HDFStore and panda.to_hdf methods. I saved into one dataframe for each first letter of the last name. Then, I created a function finding the closest match, based on the python-Levenshtein module (very fast, since it is programmed in C).

And last, I sent 26 batch jobs at once, one for each letter of the lastname. That means I only match people with same initial of lastname.

Note that I also programmed the function to find closest match with birthyear not different by more than 1 year.

EDIT: Since it was requested, I am providing below a summary of my function. The main function that merges two dataframes is too long to be posted here unfortunately.

# Needed imports:
from Levenshtein import *
import pandas as pd

# Function that get the closest match of a word in a big list:

def get_closest_match(x, list_strings,fun):
    # fun: the matching method : ratio, wrinkler, ... (cf python-Levenshtein module)
    best_match = None
    highest_ratio = 0
    for current_string in list_strings.values.tolist():
        if highest_ratio!=1:
            current_score = fun(x, current_string)
            if(current_score > highest_ratio):
                highest_ratio = current_score
                best_match = current_string
    return (best_match,highest_ratio)

# the function that matches 2 dataframes (only the idea behind, since too long to write everything
dicto={'Index':[],'score':[], ...} 
def LevRatioMerge(df1,df2,fun,colname,required=[],approx=[],eps=[]):
    # Basically loop over df1 with:
    for name in df1.itertuples():
        result=get_closest_match(name[YourColnumber],df2[YourColname],fun)
        dicto['score'].append(result[1])
        dicto['Index'].append(name[0])
        ...

This is the idea. Hope that it is inspiring enough for your job.

Sign up to request clarification or add additional context in comments.

3 Comments

So ages ago, but I am somewhat building on the same logic. May I know how you wrote your python-Levenshtein function? I am using process from FuzzyWuzzy and seems too slow for my needs.
I edited my answer, unfortunately my function is very long and really tailored for my problem. I hope that this small summary is enough for you to use the python-Levenshtein module for your own problem.
Hi. I am trying to implement the same logic for address match, keeping "city_name" as my blocking or indexing key. Is it possible to see the function you have written? That will be helpful for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.