I'm working with many tabular datasets (Excel, CSV) that contain inconsistent or messy column names due to typos, different naming conventions, spacing, punctuation, etc.
I have a standard schema (as a dict) that defines the "official" names I want to use in all my cleaned datasets.
What I want is to automatically rename columns in incoming data to match this schema as accurately as possible ideally without having to maintain a huge list of all possible variations.
Example:
import pandas as pd
df = pd.DataFrame(columns=[
'usssrname',
'e-mail*ze',
'Phoonnenumebr',
'adrress3',
'joined_date'
])
Standard mapping:
standard_mapping = {
'username': 'userName',
'email': 'emailAddress',
'phone number': 'phoneNumber',
'address': 'address',
'join date': 'registrationDate'
}
Desired result (automatically):
df.columns == [
'userName',
'emailAddress',
'phoneNumber',
'address',
'registrationDate'
]
Requirements: ✅ Automatically detect best matches between raw column names and standard keys ✅ Rename the columns in the DataFrame ✅ Log or flag uncertain matches (optional but useful) ❌ I don’t want to hardcode every possible typo or variati
Notes: I'm using pandas, but open to external Python libraries if needed
I’m aiming for something reusable and scalable across multiple files/schemas
I don’t mind fuzzy matching, vector similarity, or rule-based logic — as long as it's automatic and robust