3

The code below (Python 3.6) takes a bytes object that represents the multiplication sign in UTF-8 (b'\xc3\x97'), decodes it to a string, and writes the string to a file:

# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string 
myString = myBytes.decode('utf-8')

# Write myString to file
with open("myString.txt", "w") as ms_file:
    ms_file.write(myString)

This gives me the following result:

Bytes written to myString.txt (checked by opening the file in a hex editor): D7

The result I expected here was the 2-byte sequence C3 97, which is the UTF-8 representation of the multiplication sign. Moreover, D7 is not even a valid (one byte) UTF-8 sequence (see also UTF-8 Codepage Layout). It is the byte value that matches the ISO/IEC 8859-1 (Latin) encoding though.

So my question is simply how I can ensure that I end up with valid UTF-8 here. Am I overlooking something really obvious, or is this a bug in Python?

Some context: I ran into this issue while writing some code that processes XML files (that use UTF-8), parses the XML to an Element object with lxml, extracts text values of some elements which are subsequently written to another XML file (which also uses UTF-8). Due to this issue I can now end up with XML files that are not well-formed.

I'm using Python 3.6 under Windows 7.

EDIT: original question/code contained a function that was supposed to print a hex representation of myString to the screen, but as it turns out it was not behaving as expected. Since this made things unnecessarily confusing (also the function was not essential to the question) I removed it from the code.

5
  • What's your sys.getdefaultencoding() say? Note that your strAsHex() displays unicode codepoints returned by ord(). Commented Apr 6, 2017 at 13:08
  • Also see this stackoverflow.com/questions/27452317/… Commented Apr 6, 2017 at 13:15
  • @Ilja Thanks, my default enoding is utf-8. Wasn't 100% sure about the strAsHexeither, which is why I double-checked the result by writing the string to file (which also gives me one single D7 byte) Commented Apr 6, 2017 at 13:16
  • 2
    "In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.", also mentioned in the linked Q/A. Did not remember this myself either, so thank you for this reminder. Easiest way out is to pass encoding='utf-8' explicitly to open(). Commented Apr 6, 2017 at 13:18
  • @Ilja Brilliant, adding the encoding='utf-8' did the trick for me (also works in my full XML processing application). Thank you so much for suggesting this, this was really doing my head in! Commented Apr 6, 2017 at 13:35

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.