Replacing characters in entire Pandas dataframe with values from a dictionary

Question

I have a German csv file that was incorrectly encoded. I want to convert the characters back to utf-8 using a dictionary. I thought what I was doing was correct, but when I print the DF, nothing has changed. Here's my code:

DATA_DIR = 'C:\\...'

translations = {
    'Ã¶': 'oe',
    'Ã¼': 'ue',
    'ÃŸ': 'ss',
    'Ã¤': 'ae',
    'â‚¬': '€',
    'Ã„': 'Ae',
    'Ã–': 'Oe',
    'Ãœ': 'Ue'
}


def cleanup():
    for file in os.listdir(os.path.join(DATA_DIR)):
        if not file.lower().endswith('.csv'):
            continue

        data_utf = pd.read_csv(os.path.join(DATA_DIR, file), header=3, index_col=None, skiprows=0-2)

        data_utf.replace(translations, inplace=True)

        print(data_utf)

if __name__ == '__main__':
    cleanup()

I also tried

        for before, after in translations.items():
            data_utf.replace(before, after)

within the function, and directly putting the translations in the replace itself. This process works if I specify the column in which to replace the characters, however. What do I need to do to apply these translations to the whole dataframe, as well as to the dataframe column headers? Thanks!

reopened, because more complicated - also replace by columns values — jezrael
– jezrael, Commented Oct 31, 2019 at 12:00

jezrael · Accepted Answer · 2019-10-31 11:45:19Z

0

Add regex=True for replace in substrings, for columns is possible convert values to Series by Index.to_series and then use replace:

data_utf = pd.DataFrame({'raÃœing':['Ã¶saÃ¼s','Ã„ dd Ã–','Ã–Ã„']})

translations = {
    'Ã¶': 'oe',
    'Ã¼': 'ue',
    'ÃŸ': 'ss',
    'Ã¤': 'ae',
    'â‚¬': '€',
    'Ã„': 'Ae',
    'Ã–': 'Oe',
    'Ãœ': 'Ue'
}

data_utf.replace(translations, inplace=True, regex=True)
data_utf.columns = data_utf.columns.to_series().replace(translations, regex=True)
print (data_utf)
    raUeing
0   oesaues
1  Ae dd Oe
2      OeAe

edited Oct 31, 2019 at 11:45

answered Oct 31, 2019 at 11:41

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

trashdragon Over a year ago

Great, that worked! I knew it had to be simple. However, it doesn't change the column headers. Is there anything special I have to do to get the translation to cover the headers?

jezrael Over a year ago

@trashdragon - Can you add to question is necessary also replace columns names?

trashdragon Over a year ago

The to_series() part was giving me an 'unresolved attribute reference', so to get it to work I had to use data_utf.rename(columns={'ArtikelÂbezeichnung': 'Artikel bezeichnung'}), though this is unfortunately hard-coded.

jezrael Over a year ago

@trashdragon - What is your pandas version? Tested in pandas 0.25.0

trashdragon Over a year ago

my version is 0.25.2. Perhaps there's an error with Pycharm because when I run it, it works, but it's still marking as unresolved. Either way it works so thank you :)

Collectives™ on Stack Overflow

Replacing characters in entire Pandas dataframe with values from a dictionary

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related