0

I am not able to convert

'Schutzt\xc3\xbcren'.encode("utf-8")

the following to unicode, but cannot, getting the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

I would like to get

'Schutztüren'

as a result.

9
  • 1
    Oups, ''Schutzt\xc3\xbcren'' is not ASCII! ASCII codes must be in range 1-127. It is the utf-8 encoded byte string for 'Schutztüren'. In the same idea, you encode a unicode string to a byte string with an encoding and decode a byte string to an unicode string. Commented May 12, 2017 at 14:11
  • @SergeBallesta: I have tuples that contain these strings and these tuples are cells in a DataFrame that I save to disk with .to_csv. I get these ugly strings. How do I get them on disk in a nice format? Commented May 12, 2017 at 14:13
  • @Make42, switch to Python3, then add encoding="utf8" both to df.to_csv() and and when you read your file. Everything will Just Work. Commented May 12, 2017 at 14:15
  • @alexis: No can do. Company policy for now. Commented May 12, 2017 at 14:16
  • Saw your other message, I sympathize. See my reply there. Commented May 12, 2017 at 14:18

3 Answers 3

5

Your string is already in utf-8. You need to decode it to Unicode in order to use it inside Python:

print 'Schutzt\xc3\xbcren'.decode("utf-8")

But you have a bigger problem: You are clearly using Python 2. Switch to Python 3 immediately, there is no reason to drive yourself crazy trying to understand the Python 2 approach to handling character encodings. Switch to Python 3 and you will not have to bang your head against your desk several times a day. (Note that although you were calling the encode() method, you got a UnicodeDecodeError.

A simple explanation:

  • In Python, unicode and utf-8 are different things. A str in Python 2 might be in the "utf-8" encoding, unicode objects have no encoding.
  • If you try to use a str for something that requires unicode (e.g., to encode() it), or vice versa, Python 2 will try to implicitly convert it first. Except it doesn't know the encoding of your strings, so it guesses (ascii, in your case). Oops.
  • Python2 has a lot of implicit conversions.

But really the reason is simple: You are not using Python 3.

Edit: Since Python 3 is not an option, here is some practical advice:

  1. Unicode sandwich: Convert all text to Unicode as soon as it's read in, work with unicode strings and encode back to a utf8 str only to write it out again.

  2. Pandas should still support the encoding argument to to_csv(), even on Python 2. Use it to write your files in utf8.

  3. For reading a file directly, use codecs.open() instead of plain open() to read files. It accepts the encoding= argument and will give you unicode strings.

Sign up to request clarification or add additional context in comments.

4 Comments

Sure, I would love to, but... company policy for now.
Ouch... you're in for a world of hurt. If it's a new project, try the "unicode sandwich" approach (convert everything to unicode as soon as you read it, convert back to str only when writing to files).
I've been using Python2 since 1999, still using it daily, and I never feel the need to "bang my head against my back". Python2 "approach to handling character encodings" is nothing complicated, really - unless you don't have a clue about unicode, byte strings and encodings of course but then Python3 won't help.
@bruno, that's because you have been using it since 1999! :-) Of course it works pretty well if you already know it. But wrapping one's head around this now, when Python 3 is available, is useless pain. And Python 3 will help a lot if you don't have a clue. I teach this stuff so I say so from experience. The Python 2 version isn't worth the trouble to teach to beginners-- and by extension to anyone, unless they have no other choice.
1

You need to use decode utf-8 encoded string to unicode instead.

'Schutzt\xc3\xbcren'.decode("utf-8")

10 Comments

It worked. Can you explain how you came to this solution?
'Schutzt\xc3\xbcren'.decode("utf-8") results in u'Schutzt\xfcren' - so not working for me.
To explain in bit more detail, utf-8 is an encoding to store unicode characters. Normally a unicode character is formed of 2 bytes i.e. 16 bits. Conventionally ascii characters used to be stored and represented by single byte. So utf-8 is an encoding that allows unicode characters to be stored separately in two separate bytes, and then combine them for rendering them as unicode characters.
@Make42: it works. '\xfc' is the unicode code for 'ü', or in unicode notation U+00FC. print 'Schutzt\xc3\xbcren'.decode("utf-8") should give correct output (if your terminal is correctly configured).
@SergeBallesta Could you please let me know too about how to configure the terminal for the issue. I also have been facing this issue for quite a while.
|
0

in python 3 you'd need to decode the bytes that are your encoded string:

b'Schutzt\xc3\xbcren'.decode("utf-8")

in python 2 the b is not necessary (here the distinction between bytes and strings is less strict...).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.