I am trying to approximately match 600,000 individuals names (Full name) to another database that has over 87 millions observations (Full name) !
My first attempt with fuzzywuzzy library was way too slow, so I decided to use the module fuzzyset which is much faster. Assuming I have a computer powerful enough to load all the dataset in memory, I am doing the following with a test file of 964 observations to be matched against 50,000 observations:
import time
from cfuzzyset import cFuzzySet as FuzzySet
df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against
a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
a.add(str(row)) # Fill the FuzzySet object with all names from file2
start_time = time.time() # Start recording the time
dicto={'index':[],'name':[]} # Dictionary where I store the output
for names in file1['f_ofulln']:
dicto['index'].append(a.get(names)[0][0])
dicto['name'].append(a.get(names)[0][1])
print("--- %s seconds ---" % (time.time() - start_time))
>>> --- 39.68284249305725 seconds ---
With a much smaller dataset (964 observations matched against 50,000 observations), the time was 39 sec.
However, this is too slow if I want to perform this method on the full dataset.
Does anyone has an idea of how to improve the run time ? I think that Cython is not a possibility since I am already importing the Cython version of fuzzyset module
Many thanks,
Adrien
dedupehas some implementation of blocking technique. However, I'm not sure if it's going to scale to dataset that you have. Another possibility is to drop_duplicate names in both sets that you have before you perform fuzzy matching. (sorry for my vague answer...)