1

What I'm trying to do is find and correct similar names in my database, like 'Patrick Maxwell' and 'Patrick Maxwel.' However, the issue I'm facing is that the best match for each name is often itself, even if it's spelled slightly differently, like 'Patrick Maxwel.' This doesn't help me consolidate the names into a single correct version.

def create_corrected_dict(names_list, threshold=90):
    # Filters valid names and removes unwanted characters
    filtered_names = [name for name in names_list if name.strip() and not any(char in name for char in ['/','\\','[',']','~'])]

    # Pre-calculates the fuzzy correspondences
    fuzzy_matches = {}
    for name in filtered_names:
        match = process.extractOne(name, filtered_names, scorer=fuzz.token_sort_ratio)
        if match and match[1] > threshold and match[0] != name:
            fuzzy_matches[name] = match[0]

    corrected_dict = {}
    for name in names_list:
        cleaned_name = name.strip()
        if not cleaned_name or any(char in cleaned_name for char in ['/','\\','[',']','~']):
            corrected_dict[name] = name
        elif cleaned_name in fuzzy_matches:
            corrected_dict[name] = fuzzy_matches[cleaned_name]
        else:
            corrected_dict[name] = cleaned_name
    return corrected_dict

# Create a correction dictionary
unique_names = df_resultante['names'].unique()
dicionario_corrigido = create_corrected_dict(unique_names)

# Applying name correction
corrected_names = create_corrected_dict(df_resultante['names'].tolist())
df_resultante['Colaborador'] = df_resultante['Colaborador'].map(corrected_names)
3
  • 1
    This type of question would benefit from having a list containing several different examples of the types of names you are trying to match, so that we can test solutions and provide an answers that solves for the different variations. In any case, how do you plan on knowing which version is correct? Commented Sep 3, 2024 at 14:35
  • @Chris I can give you an example, but I was thinking of ellecting the first occurrence to be the correct version. Commented Sep 4, 2024 at 15:25
  • Here is an example of data that I was rorking on: docs.google.com/spreadsheets/d/… @Chris Commented Sep 4, 2024 at 15:34

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.