Matching 2 large csv files by Fuzzy string matching in Python

Question

I am trying to approximately match 600,000 individuals names (Full name) to another database that has over 87 millions observations (Full name) !

My first attempt with fuzzywuzzy library was way too slow, so I decided to use the module fuzzyset which is much faster. Assuming I have a computer powerful enough to load all the dataset in memory, I am doing the following with a test file of 964 observations to be matched against 50,000 observations:

import time
from cfuzzyset import cFuzzySet as FuzzySet

df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against

a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
   a.add(str(row)) # Fill the FuzzySet object with all names from file2

start_time = time.time() # Start recording the time

dicto={'index':[],'name':[]} # Dictionary where I store the output

for names in file1['f_ofulln']:
    dicto['index'].append(a.get(names)[0][0])
    dicto['name'].append(a.get(names)[0][1])

print("--- %s seconds ---" % (time.time() - start_time))   

>>> --- 39.68284249305725 seconds ---

With a much smaller dataset (964 observations matched against 50,000 observations), the time was 39 sec.

However, this is too slow if I want to perform this method on the full dataset.

Does anyone has an idea of how to improve the run time ? I think that Cython is not a possibility since I am already importing the Cython version of fuzzyset module

Many thanks,

Adrien

This problem is a lot harder than it seems. You might have to block (or clustering) 87M data first before you perform matching. I wouldn't suggest finding all distance between 600k name to every names in database. Python library called dedupe has some implementation of blocking technique. However, I'm not sure if it's going to scale to dataset that you have. Another possibility is to drop_duplicate names in both sets that you have before you perform fuzzy matching. (sorry for my vague answer...) — titipata
– titipata, Commented Jul 19, 2016 at 23:25
Thanks titpat, I will definitely do as much as possible to reduce the size of datasets to be merged. I also modified the question, since I discovered that the fuzzyset module is much faster. Last, I found ressources to store the entire dataframe in memory. But even with all that, the final run time is still counted in weeks ! — Adrien
– Adrien, Commented Jul 20, 2016 at 6:22

Adrien · Accepted Answer · 2016-08-23 15:16:57Z

3

So I am going to answer my own question since I found a way that is pretty fast.

I saved both databases in HDF5 format, using the panda.HDFStore and panda.to_hdf methods. I saved into one dataframe for each first letter of the last name. Then, I created a function finding the closest match, based on the python-Levenshtein module (very fast, since it is programmed in C).

And last, I sent 26 batch jobs at once, one for each letter of the lastname. That means I only match people with same initial of lastname.

Note that I also programmed the function to find closest match with birthyear not different by more than 1 year.

EDIT: Since it was requested, I am providing below a summary of my function. The main function that merges two dataframes is too long to be posted here unfortunately.

# Needed imports:
from Levenshtein import *
import pandas as pd

# Function that get the closest match of a word in a big list:

def get_closest_match(x, list_strings,fun):
    # fun: the matching method : ratio, wrinkler, ... (cf python-Levenshtein module)
    best_match = None
    highest_ratio = 0
    for current_string in list_strings.values.tolist():
        if highest_ratio!=1:
            current_score = fun(x, current_string)
            if(current_score > highest_ratio):
                highest_ratio = current_score
                best_match = current_string
    return (best_match,highest_ratio)

# the function that matches 2 dataframes (only the idea behind, since too long to write everything
dicto={'Index':[],'score':[], ...} 
def LevRatioMerge(df1,df2,fun,colname,required=[],approx=[],eps=[]):
    # Basically loop over df1 with:
    for name in df1.itertuples():
        result=get_closest_match(name[YourColnumber],df2[YourColname],fun)
        dicto['score'].append(result[1])
        dicto['Index'].append(name[0])
        ...

This is the idea. Hope that it is inspiring enough for your job.

edited Aug 23, 2016 at 15:16

answered Aug 13, 2016 at 6:04

Adrien

4716 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BernardL Over a year ago

So ages ago, but I am somewhat building on the same logic. May I know how you wrote your python-Levenshtein function? I am using process from FuzzyWuzzy and seems too slow for my needs.

Adrien Over a year ago

I edited my answer, unfortunately my function is very long and really tailored for my problem. I hope that this small summary is enough for you to use the python-Levenshtein module for your own problem.

ds_user Over a year ago

Hi. I am trying to implement the same logic for address match, keeping "city_name" as my blocking or indexing key. Is it possible to see the function you have written? That will be helpful for me.

Collectives™ on Stack Overflow

Matching 2 large csv files by Fuzzy string matching in Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related