The code below (Python 3.6) takes a bytes object that represents the multiplication sign in UTF-8 (b'\xc3\x97'), decodes it to a string, and writes the string to a file:
# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string
myString = myBytes.decode('utf-8')
# Write myString to file
with open("myString.txt", "w") as ms_file:
ms_file.write(myString)
This gives me the following result:
Bytes written to myString.txt (checked by opening the file in a hex editor): D7
The result I expected here was the 2-byte sequence C3 97, which is the UTF-8 representation of the multiplication sign. Moreover, D7 is not even a valid (one byte) UTF-8 sequence (see also UTF-8 Codepage Layout). It is the byte value that matches the ISO/IEC 8859-1 (Latin) encoding though.
So my question is simply how I can ensure that I end up with valid UTF-8 here. Am I overlooking something really obvious, or is this a bug in Python?
Some context: I ran into this issue while writing some code that processes XML files (that use UTF-8), parses the XML to an Element object with lxml, extracts text values of some elements which are subsequently written to another XML file (which also uses UTF-8). Due to this issue I can now end up with XML files that are not well-formed.
I'm using Python 3.6 under Windows 7.
EDIT: original question/code contained a function that was supposed to print a hex representation of myString to the screen, but as it turns out it was not behaving as expected. Since this made things unnecessarily confusing (also the function was not essential to the question) I removed it from the code.
sys.getdefaultencoding()say? Note that yourstrAsHex()displays unicode codepoints returned byord().utf-8. Wasn't 100% sure about thestrAsHexeither, which is why I double-checked the result by writing the string to file (which also gives me one singleD7byte)encoding='utf-8'explicitly toopen().encoding='utf-8'did the trick for me (also works in my full XML processing application). Thank you so much for suggesting this, this was really doing my head in!