0

I have a German csv file that was incorrectly encoded. I want to convert the characters back to utf-8 using a dictionary. I thought what I was doing was correct, but when I print the DF, nothing has changed. Here's my code:

DATA_DIR = 'C:\\...'

translations = {
    'ö': 'oe',
    'ü': 'ue',
    'ß': 'ss',
    'ä': 'ae',
    '€': '€',
    'Ä': 'Ae',
    'Ö': 'Oe',
    'Ü': 'Ue'
}


def cleanup():
    for file in os.listdir(os.path.join(DATA_DIR)):
        if not file.lower().endswith('.csv'):
            continue

        data_utf = pd.read_csv(os.path.join(DATA_DIR, file), header=3, index_col=None, skiprows=0-2)

        data_utf.replace(translations, inplace=True)

        print(data_utf)

if __name__ == '__main__':
    cleanup()

I also tried

        for before, after in translations.items():
            data_utf.replace(before, after)

within the function, and directly putting the translations in the replace itself. This process works if I specify the column in which to replace the characters, however. What do I need to do to apply these translations to the whole dataframe, as well as to the dataframe column headers? Thanks!

1
  • reopened, because more complicated - also replace by columns values Commented Oct 31, 2019 at 12:00

1 Answer 1

0

Add regex=True for replace in substrings, for columns is possible convert values to Series by Index.to_series and then use replace:

data_utf = pd.DataFrame({'raÜing':['ösaüs','Ä dd Ö','ÖÄ']})

translations = {
    'ö': 'oe',
    'ü': 'ue',
    'ß': 'ss',
    'ä': 'ae',
    '€': '€',
    'Ä': 'Ae',
    'Ö': 'Oe',
    'Ü': 'Ue'
}

data_utf.replace(translations, inplace=True, regex=True)
data_utf.columns = data_utf.columns.to_series().replace(translations, regex=True)
print (data_utf)
    raUeing
0   oesaues
1  Ae dd Oe
2      OeAe
Sign up to request clarification or add additional context in comments.

5 Comments

Great, that worked! I knew it had to be simple. However, it doesn't change the column headers. Is there anything special I have to do to get the translation to cover the headers?
@trashdragon - Can you add to question is necessary also replace columns names?
The to_series() part was giving me an 'unresolved attribute reference', so to get it to work I had to use data_utf.rename(columns={'Artikel­bezeichnung': 'Artikel bezeichnung'}), though this is unfortunately hard-coded.
@trashdragon - What is your pandas version? Tested in pandas 0.25.0
my version is 0.25.2. Perhaps there's an error with Pycharm because when I run it, it works, but it's still marking as unresolved. Either way it works so thank you :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.