Decoding of bytes object results in unexpected + invalid UTF-8 - how can I avoid this?

Ask Question

Asked 8 years, 8 months ago

Modified 8 years, 8 months ago

Viewed 653 times

The code below (Python 3.6) takes a bytes object that represents the multiplication sign in UTF-8 (b'\xc3\x97'), decodes it to a string, and writes the string to a file:

# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string 
myString = myBytes.decode('utf-8')

# Write myString to file
with open("myString.txt", "w") as ms_file:
    ms_file.write(myString)

This gives me the following result:

Bytes written to myString.txt (checked by opening the file in a hex editor): D7

The result I expected here was the 2-byte sequence C3 97, which is the UTF-8 representation of the multiplication sign. Moreover, D7 is not even a valid (one byte) UTF-8 sequence (see also UTF-8 Codepage Layout). It is the byte value that matches the ISO/IEC 8859-1 (Latin) encoding though.

So my question is simply how I can ensure that I end up with valid UTF-8 here. Am I overlooking something really obvious, or is this a bug in Python?

Some context: I ran into this issue while writing some code that processes XML files (that use UTF-8), parses the XML to an Element object with lxml, extracts text values of some elements which are subsequently written to another XML file (which also uses UTF-8). Due to this issue I can now end up with XML files that are not well-formed.

I'm using Python 3.6 under Windows 7.

EDIT: original question/code contained a function that was supposed to print a hex representation of myString to the screen, but as it turns out it was not behaving as expected. Since this made things unnecessarily confusing (also the function was not essential to the question) I removed it from the code.

edited Apr 6, 2017 at 14:14

asked Apr 6, 2017 at 13:04

johan

8741 gold badge8 silver badges21 bronze badges

What's your sys.getdefaultencoding() say? Note that your strAsHex() displays unicode codepoints returned by ord().

Ilja Everilä
– Ilja Everilä

2017-04-06 13:08:01 +00:00
Commented Apr 6, 2017 at 13:08
Also see this stackoverflow.com/questions/27452317/…

Ilja Everilä
– Ilja Everilä

2017-04-06 13:15:10 +00:00
Commented Apr 6, 2017 at 13:15
@Ilja Thanks, my default enoding is utf-8. Wasn't 100% sure about the strAsHexeither, which is why I double-checked the result by writing the string to file (which also gives me one single D7 byte)

johan
– johan

2017-04-06 13:16:07 +00:00
Commented Apr 6, 2017 at 13:16
2

"In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.", also mentioned in the linked Q/A. Did not remember this myself either, so thank you for this reminder. Easiest way out is to pass encoding='utf-8' explicitly to open().

Ilja Everilä
– Ilja Everilä

2017-04-06 13:18:46 +00:00
Commented Apr 6, 2017 at 13:18
@Ilja Brilliant, adding the encoding='utf-8' did the trick for me (also works in my full XML processing application). Thank you so much for suggesting this, this was really doing my head in!

johan
– johan

2017-04-06 13:35:55 +00:00
Commented Apr 6, 2017 at 13:35

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Decoding of bytes object results in unexpected + invalid UTF-8 - how can I avoid this?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked