Automatically map messy column names to a standard schema in pandas

Question

I'm working with many tabular datasets (Excel, CSV) that contain inconsistent or messy column names due to typos, different naming conventions, spacing, punctuation, etc.

I have a standard schema (as a dict) that defines the "official" names I want to use in all my cleaned datasets.

What I want is to automatically rename columns in incoming data to match this schema as accurately as possible ideally without having to maintain a huge list of all possible variations.

Example:

import pandas as pd

df = pd.DataFrame(columns=[
    'usssrname', 
    'e-mail*ze', 
    'Phoonnenumebr', 
    'adrress3', 
    'joined_date'
])

Standard mapping:

standard_mapping = {
    'username': 'userName',
    'email': 'emailAddress',
    'phone number': 'phoneNumber',
    'address': 'address',
    'join date': 'registrationDate'
}
Desired result (automatically):
df.columns == [
    'userName', 
    'emailAddress', 
    'phoneNumber', 
    'address', 
    'registrationDate'
]

Requirements: ✅ Automatically detect best matches between raw column names and standard keys ✅ Rename the columns in the DataFrame ✅ Log or flag uncertain matches (optional but useful) ❌ I don’t want to hardcode every possible typo or variati

Notes: I'm using pandas, but open to external Python libraries if needed

I’m aiming for something reusable and scalable across multiple files/schemas

I don’t mind fuzzy matching, vector similarity, or rule-based logic — as long as it's automatic and robust

You've already got an idea of somethings that may work (fuzzy matching, vector similarity etc), so please share with us what methods you have already tried and why they didn't work for your desired outcome. SO isn't a free code writing service, we require you to show some effort yourself — Emi OB
– Emi OB, Commented Jun 25 at 15:13

Chris · Accepted Answer · 2025-06-25 16:01:41Z

0

You could use difflib or any other matching library.

There are tons of similarity matching libraries which you may want to dig into to handle edge cases, get similarity scores etc.

import pandas as pd
import difflib
df = pd.DataFrame(columns=[
    'usssrname',
    'e-mail*ze',
    'Phoonnenumebr',
    'adrress3',
    'joined_date'
])
standard_mapping = {
    'username': 'userName',
    'email': 'emailAddress',
    'phone number': 'phoneNumber',
    'address': 'address',
    'join date': 'registrationDate'
 }

df.columns = [standard_mapping[c] for c in [difflib.get_close_matches(x,standard_mapping.keys())[0] for x in df.columns]]

answered Jun 25 at 16:01

Chris

16.3k3 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mozway · Accepted Answer · 2025-06-25 19:04:37Z

0

You could use thefuzz and its process.extracOne method, optionally defining a threshold (by default, score_cutoff=0):

from thefuzz.process import extractOne

df.columns = [
    standard_mapping.get(x[0])
    if (x := extractOne(c, list(standard_mapping), score_cutoff=70))
    else c
    for c in df
]

Updated columns:

['userName', 'emailAddress', 'phoneNumber', 'address', 'registrationDate']

answered Jun 25 at 19:04

mozway

267k13 gold badges56 silver badges106 bronze badges

Collectives™ on Stack Overflow

Automatically map messy column names to a standard schema in pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related