Transform ascii to unicode

Question

I am not able to convert

'Schutzt\xc3\xbcren'.encode("utf-8")

the following to unicode, but cannot, getting the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

I would like to get

'Schutztüren'

as a result.

Oups, ''Schutzt\xc3\xbcren'' is not ASCII! ASCII codes must be in range 1-127. It is the utf-8 encoded byte string for 'Schutztüren'. In the same idea, you encode a unicode string to a byte string with an encoding and decode a byte string to an unicode string. — Serge Ballesta
– Serge Ballesta, Commented May 12, 2017 at 14:11
@SergeBallesta: I have tuples that contain these strings and these tuples are cells in a DataFrame that I save to disk with .to_csv. I get these ugly strings. How do I get them on disk in a nice format? — Make42
– Make42, Commented May 12, 2017 at 14:13
@Make42, switch to Python3, then add encoding="utf8" both to df.to_csv() and and when you read your file. Everything will Just Work. — alexis
– alexis, Commented May 12, 2017 at 14:15

alexis · Accepted Answer · 2017-05-14 10:55:45Z

5

Your string is already in utf-8. You need to decode it to Unicode in order to use it inside Python:

print 'Schutzt\xc3\xbcren'.decode("utf-8")

But you have a bigger problem: You are clearly using Python 2. Switch to Python 3 immediately, there is no reason to drive yourself crazy trying to understand the Python 2 approach to handling character encodings. Switch to Python 3 and you will not have to bang your head against your desk several times a day. (Note that although you were calling the encode() method, you got a UnicodeDecodeError.

A simple explanation:

In Python, unicode and utf-8 are different things. A str in Python 2 might be in the "utf-8" encoding, unicode objects have no encoding.
If you try to use a str for something that requires unicode (e.g., to encode() it), or vice versa, Python 2 will try to implicitly convert it first. Except it doesn't know the encoding of your strings, so it guesses (ascii, in your case). Oops.
Python2 has a lot of implicit conversions.

But really the reason is simple: You are not using Python 3.

Edit: Since Python 3 is not an option, here is some practical advice:

Unicode sandwich: Convert all text to Unicode as soon as it's read in, work with unicode strings and encode back to a utf8 str only to write it out again.
Pandas should still support the encoding argument to to_csv(), even on Python 2. Use it to write your files in utf8.
For reading a file directly, use codecs.open() instead of plain open() to read files. It accepts the encoding= argument and will give you unicode strings.

edited May 14, 2017 at 10:55

answered May 12, 2017 at 14:10

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Make42 Over a year ago

Sure, I would love to, but... company policy for now.

alexis Over a year ago

Ouch... you're in for a world of hurt. If it's a new project, try the "unicode sandwich" approach (convert everything to unicode as soon as you read it, convert back to str only when writing to files).

bruno desthuilliers Over a year ago

I've been using Python2 since 1999, still using it daily, and I never feel the need to "bang my head against my back". Python2 "approach to handling character encodings" is nothing complicated, really - unless you don't have a clue about unicode, byte strings and encodings of course but then Python3 won't help.

alexis Over a year ago

@bruno, that's because you have been using it since 1999! :-) Of course it works pretty well if you already know it. But wrapping one's head around this now, when Python 3 is available, is useless pain. And Python 3 will help a lot if you don't have a clue. I teach this stuff so I say so from experience. The Python 2 version isn't worth the trouble to teach to beginners-- and by extension to anyone, unless they have no other choice.

hspandher · Accepted Answer · 2017-05-12 14:06:20Z

1

You need to use decode utf-8 encoded string to unicode instead.

'Schutzt\xc3\xbcren'.decode("utf-8")

answered May 12, 2017 at 14:06

hspandher

16.8k2 gold badges35 silver badges49 bronze badges

10 Comments

Ujjaval Moradiya Over a year ago

It worked. Can you explain how you came to this solution?

Make42 Over a year ago

'Schutzt\xc3\xbcren'.decode("utf-8") results in u'Schutzt\xfcren' - so not working for me.

hspandher Over a year ago

To explain in bit more detail, utf-8 is an encoding to store unicode characters. Normally a unicode character is formed of 2 bytes i.e. 16 bits. Conventionally ascii characters used to be stored and represented by single byte. So utf-8 is an encoding that allows unicode characters to be stored separately in two separate bytes, and then combine them for rendering them as unicode characters.

Serge Ballesta Over a year ago

@Make42: it works. '\xfc' is the unicode code for 'ü', or in unicode notation U+00FC. print 'Schutzt\xc3\xbcren'.decode("utf-8") should give correct output (if your terminal is correctly configured).

hspandher Over a year ago

@SergeBallesta Could you please let me know too about how to configure the terminal for the issue. I also have been facing this issue for quite a while.

|

hiro protagonist · Accepted Answer · 2017-05-12 14:08:12Z

0

in python 3 you'd need to decode the bytes that are your encoded string:

b'Schutzt\xc3\xbcren'.decode("utf-8")

in python 2 the b is not necessary (here the distinction between bytes and strings is less strict...).

answered May 12, 2017 at 14:08

hiro protagonist

47.4k17 gold badges93 silver badges119 bronze badges

Collectives™ on Stack Overflow

Transform ascii to unicode

3 Answers 3

4 Comments

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related