Trying to fix names in my database with fuzzywuzzy

Ask Question

Asked 1 year, 2 months ago

Modified 1 year, 2 months ago

Viewed 78 times

What I'm trying to do is find and correct similar names in my database, like 'Patrick Maxwell' and 'Patrick Maxwel.' However, the issue I'm facing is that the best match for each name is often itself, even if it's spelled slightly differently, like 'Patrick Maxwel.' This doesn't help me consolidate the names into a single correct version.

def create_corrected_dict(names_list, threshold=90):
    # Filters valid names and removes unwanted characters
    filtered_names = [name for name in names_list if name.strip() and not any(char in name for char in ['/','\\','[',']','~'])]

    # Pre-calculates the fuzzy correspondences
    fuzzy_matches = {}
    for name in filtered_names:
        match = process.extractOne(name, filtered_names, scorer=fuzz.token_sort_ratio)
        if match and match[1] > threshold and match[0] != name:
            fuzzy_matches[name] = match[0]

    corrected_dict = {}
    for name in names_list:
        cleaned_name = name.strip()
        if not cleaned_name or any(char in cleaned_name for char in ['/','\\','[',']','~']):
            corrected_dict[name] = name
        elif cleaned_name in fuzzy_matches:
            corrected_dict[name] = fuzzy_matches[cleaned_name]
        else:
            corrected_dict[name] = cleaned_name
    return corrected_dict

# Create a correction dictionary
unique_names = df_resultante['names'].unique()
dicionario_corrigido = create_corrected_dict(unique_names)

# Applying name correction
corrected_names = create_corrected_dict(df_resultante['names'].tolist())
df_resultante['Colaborador'] = df_resultante['Colaborador'].map(corrected_names)

edited Sep 3, 2024 at 16:00

user4136999

asked Sep 3, 2024 at 12:48

Kauan Randall Oliveira Ferreir

111 bronze badge

1

This type of question would benefit from having a list containing several different examples of the types of names you are trying to match, so that we can test solutions and provide an answers that solves for the different variations. In any case, how do you plan on knowing which version is correct?

Chris
– Chris

2024-09-03 14:35:51 +00:00
Commented Sep 3, 2024 at 14:35
@Chris I can give you an example, but I was thinking of ellecting the first occurrence to be the correct version.

Kauan Randall Oliveira Ferreir
– Kauan Randall Oliveira Ferreir

2024-09-04 15:25:31 +00:00
Commented Sep 4, 2024 at 15:25
Here is an example of data that I was rorking on: docs.google.com/spreadsheets/d/… @Chris

Kauan Randall Oliveira Ferreir
– Kauan Randall Oliveira Ferreir

2024-09-04 15:34:26 +00:00
Commented Sep 4, 2024 at 15:34

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Trying to fix names in my database with fuzzywuzzy

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest