6
$\begingroup$

As part of my internship, I am working on a project where I need to process two Excel files:

File 1 contains names and numbers. File 2 contains names and an empty column for amounts.

The goal is to match the names from File 1 with those in File 2 and correctly fill in the corresponding amounts. However, the names may have slight variations in spelling or formatting between the two files (e.g., typos, different spacing, or abbreviations).

To handle this, I am currently using TheFuzz (FuzzyWuzzy) in Python for approximate string matching. I have noticed that choosing the right similarity threshold is crucial—setting it too low results in incorrect matches, while setting it too high may cause valid matches to be missed.

I am also considering using an LLM API to improve matching accuracy.

Are there other libraries or approaches in Python that could improve fuzzy matching for this kind of task?

Has anyone used LLMs for this type of problem, and if so, how effective was it?

Thanks for your help

I used TheFuzz (FuzzyWuzzy) in Python to perform fuzzy matching between the names in the two Excel files. I experimented with different similarity thresholds to balance precision and recall.

$\endgroup$
1
  • 1
    $\begingroup$ How many names are in each file ? $\endgroup$ Commented Feb 11 at 13:44

1 Answer 1

5
$\begingroup$

Welcome to Data Science Stackexchange!

A good approach to improving fuzzy matching between two data sources involves thorough pre-processing to reduce inconsistencies in name formats. The following steps can enhance matching performance.

Pre-processing Suggestions

  1. Convert text to lower case – For consistency across both data sources.
  2. Remove punctuation – Variations in punctuation can lead to mismatches.
  3. Remove common suffixes (for company names) – Terms like "Ltd," "Inc," or "LLC" often do not impact name identity and can be safely removed.
  4. Filter out stop words – Terms such as "Dr," "Mr," or "Corp" add noise and should be removed.
  5. Use phonetic matching – Algorithms like Soundex or Metaphone handle different spellings with similar pronunciations (eg., Jon/John).
  6. Apply spelling normalisation – Removing vowels (eg., "Leonard" → "lnrd") and simplifying double consonants (eg., "Allen" → "Alen") can reduce spelling variability.

Optimising the Similarity Threshold

Choosing an optimal threshold for string similarity is obviously important. If the goal is to use the model to match new/unseen data in the future, then if you simply adjust the threshold until you get the best matches, it could be seriously overfitted and won't generalise to the unseen data. Cross validation (CV) can help determine this threshold by balancing precision and recall. With annotated name pairs, CV can be used to maximise the F1 score, which is defined as:

$$ F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$

This score provides a balanced measure of performance, but the emphasis on precision vs. recall may vary depending on your application's priorities. In applications, the "cost" of a false positive may be hugely different from each other (eg., think about cancer screening in oncology, and communicable disease screening in public health). However, CV requires a labelled dataset with both positive and negative matches, which may not be feasible for smaller datasets.

Obviously, if there is no expectation of using the model for futrure/new data, then CV is probably not needed.

Alternative Approaches and Tools

In addition to TheFuzz, consider the following:

  • Probabilistic String Matching: This is similar to fuzzy matching and can be consider an extension of it by incorporating multiple features (eg., names, dates, addresses) to compute an overall match probability. Obviously it is not applicable on your case, unless you have other date about the names.
  • Cosine Similarity: If the names consist of multiple words, cosine similarity (where the names are vectorised and then the angle between them id computed) might work well.
  • Libraries: rapidfuzz offers faster, optimised string matching with similar functionality to TheFuzz.
  • Large Language Machine (LLM)-based matching: I have never tried this, but I think it could be very useful and certainly with trying out. I imagine LLMs can enhance name matching by learning contextual similarities. .
  • Hybrid approaches: Combining rule-based methods outlined above with LLMs could improve both precision and recall.
$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.