0

I'm working with many tabular datasets (Excel, CSV) that contain inconsistent or messy column names due to typos, different naming conventions, spacing, punctuation, etc.

I have a standard schema (as a dict) that defines the "official" names I want to use in all my cleaned datasets.

What I want is to automatically rename columns in incoming data to match this schema as accurately as possible ideally without having to maintain a huge list of all possible variations.

Example:

import pandas as pd

df = pd.DataFrame(columns=[
    'usssrname', 
    'e-mail*ze', 
    'Phoonnenumebr', 
    'adrress3', 
    'joined_date'
])

Standard mapping:

standard_mapping = {
    'username': 'userName',
    'email': 'emailAddress',
    'phone number': 'phoneNumber',
    'address': 'address',
    'join date': 'registrationDate'
}
Desired result (automatically):
df.columns == [
    'userName', 
    'emailAddress', 
    'phoneNumber', 
    'address', 
    'registrationDate'
]

Requirements: ✅ Automatically detect best matches between raw column names and standard keys ✅ Rename the columns in the DataFrame ✅ Log or flag uncertain matches (optional but useful) ❌ I don’t want to hardcode every possible typo or variati

Notes: I'm using pandas, but open to external Python libraries if needed

I’m aiming for something reusable and scalable across multiple files/schemas

I don’t mind fuzzy matching, vector similarity, or rule-based logic — as long as it's automatic and robust

1
  • 1
    You've already got an idea of somethings that may work (fuzzy matching, vector similarity etc), so please share with us what methods you have already tried and why they didn't work for your desired outcome. SO isn't a free code writing service, we require you to show some effort yourself Commented Jun 25 at 15:13

2 Answers 2

0

You could use difflib or any other matching library.

There are tons of similarity matching libraries which you may want to dig into to handle edge cases, get similarity scores etc.

import pandas as pd
import difflib
df = pd.DataFrame(columns=[
    'usssrname',
    'e-mail*ze',
    'Phoonnenumebr',
    'adrress3',
    'joined_date'
])
standard_mapping = {
    'username': 'userName',
    'email': 'emailAddress',
    'phone number': 'phoneNumber',
    'address': 'address',
    'join date': 'registrationDate'
 }

df.columns = [standard_mapping[c] for c in [difflib.get_close_matches(x,standard_mapping.keys())[0] for x in df.columns]]
Sign up to request clarification or add additional context in comments.

Comments

0

You could use thefuzz and its process.extracOne method, optionally defining a threshold (by default, score_cutoff=0):

from thefuzz.process import extractOne

df.columns = [
    standard_mapping.get(x[0])
    if (x := extractOne(c, list(standard_mapping), score_cutoff=70))
    else c
    for c in df
]

Updated columns:

['userName', 'emailAddress', 'phoneNumber', 'address', 'registrationDate']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.