Welcome to Data Science Stackexchange!
A good approach to improving fuzzy matching between two data sources involves thorough pre-processing to reduce inconsistencies in name formats. The following steps can enhance matching performance.
Pre-processing Suggestions
- Convert text to lower case – For consistency across both data sources.
- Remove punctuation – Variations in punctuation can lead to mismatches.
- Remove common suffixes (for company names) – Terms like "Ltd," "Inc," or "LLC" often do not impact name identity and can be safely removed.
- Filter out stop words – Terms such as "Dr," "Mr," or "Corp" add noise and should be removed.
- Use phonetic matching – Algorithms like Soundex or Metaphone handle different spellings with similar pronunciations (eg., Jon/John).
- Apply spelling normalisation – Removing vowels (eg., "Leonard" → "lnrd") and simplifying double consonants (eg., "Allen" → "Alen") can reduce spelling variability.
Optimising the Similarity Threshold
Choosing an optimal threshold for string similarity is obviously important. If the goal is to use the model to match new/unseen data in the future, then if you simply adjust the threshold until you get the best matches, it could be seriously overfitted and won't generalise to the unseen data. Cross validation (CV) can help determine this threshold by balancing precision and recall. With annotated name pairs, CV can be used to maximise the F1 score, which is defined as:
$$
F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
$$
This score provides a balanced measure of performance, but the emphasis on precision vs. recall may vary depending on your application's priorities. In applications, the "cost" of a false positive may be hugely different from each other (eg., think about cancer screening in oncology, and communicable disease screening in public health). However, CV requires a labelled dataset with both positive and negative matches, which may not be feasible for smaller datasets.
Obviously, if there is no expectation of using the model for futrure/new data, then CV is probably not needed.
Alternative Approaches and Tools
In addition to TheFuzz, consider the following:
- Probabilistic String Matching: This is similar to fuzzy matching and can be consider an extension of it by incorporating multiple features (eg., names, dates, addresses) to compute an overall match probability. Obviously it is not applicable on your case, unless you have other date about the names.
- Cosine Similarity: If the names consist of multiple words, cosine similarity (where the names are vectorised and then the angle between them id computed) might work well.
- Libraries:
rapidfuzz offers faster, optimised string matching with similar functionality to TheFuzz.
- Large Language Machine (LLM)-based matching: I have never tried this, but I think it could be very useful and certainly with trying out. I imagine LLMs can enhance name matching by learning contextual similarities. .
- Hybrid approaches: Combining rule-based methods outlined above with LLMs could improve both precision and recall.