Writing unicode characters to file in Python

Question

Take the following sample data (whole list can be found here):

Ω≈ç√∫˜µ≤≥÷
åß∂ƒ©˙∆˚¬…æ
œ∑´®†¥¨ˆøπ“‘
¡™£¢∞§¶•ªº–≠
¸˛Ç◊ı˜Â¯˘¿
ÅÍÎÏ˝ÓÔÒÚÆ☃
Œ„´‰ˇÁ¨ˆØ∏”’
ヽ༼ຈل͜ຈ༽ﾉ ヽ༼ຈل͜ຈ༽ﾉ 
(｡◕ ∀ ◕｡)
｀ｨ(´∀｀∩
_   _ﾛ(,_,*)
・(￣∀￣)・:*:

I have been outputting the data from the aformentioned dump of string to a separate HTML files (there is no need to get into detail as this is irrelevant to the question) like so:

for value in tags['tags']:
    for line in data:
        with open('./output/fuzzml' + str(file_count), 'w') as output:
            parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
            output.write(parsed_string)
            file_count += 1

Which works nicely for a relatively small portion of the data dump until it comes across some of the tricky symbols like the ones above. I have modified line 5 (u''.join(line.rstrip())) multiple times in hopes of writing in a way that will output anything correctly however it will always get stuck at some point and will raise an UnicodeDecodeError exception:

Traceback (most recent call last):
File "generate-html.py", line 37, in <module>
  main()
File "generate-html.py", line 34, in main
  generate_html(tag_file, data_file)
File "generate-html.py", line 18, in generate_html
  parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

The tags are extracted from a JSON file with the following sample set:

"tags": [
          "<img src=\"[[VAR]]\">",
          "<a href=\"[[VAR]]\"><img src=\"[[VAR]]\">",
          "<script>[[VAR]]</script>",
          "<[[VAR]]>Hello World<[[VAR]]>"
   ]

data is just the raw strings from the above link/sample data.

Just comes from a JSON file, I'll append that in the question in a moment. — Juxhin
– Juxhin, Commented Aug 27, 2015 at 10:25
And what type are the objects in data? Is each line a unicode object or a str bytestring? — Martijn Pieters
– Martijn Pieters, Commented Aug 27, 2015 at 10:25
@MartijnPieters - ought to be str bytestring but I will double-check and let you know. — Juxhin
– Juxhin, Commented Aug 27, 2015 at 10:27
I'm not sure why you are using 'u''.join() on a string object here. That it just keeping Python busy for no positive effect whatsoever. ''.join() will convert the string to a list of individual characters, then rejoin those again to one string. But by using u'', you also force an implicit conversion to a unicode string. — Martijn Pieters
– Martijn Pieters, Commented Aug 27, 2015 at 10:32

Martijn Pieters · Accepted Answer · 2015-08-27 10:53:23Z

1

At issue is your use of u''.join() here:

u''.join(line.rstrip())

This is pretty useless; it is breaking up the string into individual characters, then rejoining those back into a unicode string again. You were probably aiming for the side-effect of this: implicit conversion to a unicode string.

You could get the same effect with:

unicode(line.rstrip())

which will fail with the exact same error, because neither version tells Python what codec was used for the bytestring to encode your characters.

Decode your lines explicitly; the file you linked to is encoded to UTF-8:

unicode(line.rstrip(), 'utf-8')

or

line.rstrip().decode('utf-8')

Next problem is that your parsed_string object is now a Unicode object too, so you'll need to encode that again when writing to a file:

output.write(parsed_string.encode('utf8'))

or use the io.open() function to open a file object that encodes Unicode strings for you as you write.

You may want to read:

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

before continuing to fully understand how Python and Unicode work together.

edited Aug 27, 2015 at 10:53

answered Aug 27, 2015 at 10:47

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Juxhin Over a year ago

So after importing the data file into notepad++ it seems it should be UTF-8 w/o BOM and unicode(line.rstrip(), 'utf-8') works for say ╬®Ôëê├ºÔêÜÔê½╦£┬ÁÔëñÔëÑ├À but still chokes on the rest further down the file.

Martijn Pieters Over a year ago

@Juxhin: it decodes your file just fine. You probably have other problems elsewhere.

Juxhin Over a year ago

Actually it seems that notepad++ isn't decoding all the file properly, sublime actually displays all of the characters just fine. I'll just read a bit more about it.

Juxhin Over a year ago

You summed it up nicely in the end. output.write(parsed_string.encode('utf8')) was required as we now have a Unicode object. Works great. Thanks Martijn.

Collectives™ on Stack Overflow

Writing unicode characters to file in Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related