0

Take the following sample data (whole list can be found here):

Ω≈ç√∫˜µ≤≥÷
åß∂ƒ©˙∆˚¬…æ
œ∑´®†¥¨ˆøπ“‘
¡™£¢∞§¶•ªº–≠
¸˛Ç◊ı˜Â¯˘¿
ÅÍÎÏ˝ÓÔÒÚÆ☃
Œ„´‰ˇÁ¨ˆØ∏”’
ヽ༼ຈل͜ຈ༽ノ ヽ༼ຈل͜ຈ༽ノ 
(。◕ ∀ ◕。)
`ィ(´∀`∩
_   _ロ(,_,*)
・( ̄∀ ̄)・:*:

I have been outputting the data from the aformentioned dump of string to a separate HTML files (there is no need to get into detail as this is irrelevant to the question) like so:

for value in tags['tags']:
    for line in data:
        with open('./output/fuzzml' + str(file_count), 'w') as output:
            parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
            output.write(parsed_string)
            file_count += 1

Which works nicely for a relatively small portion of the data dump until it comes across some of the tricky symbols like the ones above. I have modified line 5 (u''.join(line.rstrip())) multiple times in hopes of writing in a way that will output anything correctly however it will always get stuck at some point and will raise an UnicodeDecodeError exception:

Traceback (most recent call last):
File "generate-html.py", line 37, in <module>
  main()
File "generate-html.py", line 34, in main
  generate_html(tag_file, data_file)
File "generate-html.py", line 18, in generate_html
  parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

The tags are extracted from a JSON file with the following sample set:

"tags": [
          "<img src=\"[[VAR]]\">",
          "<a href=\"[[VAR]]\"><img src=\"[[VAR]]\">",
          "<script>[[VAR]]</script>",
          "<[[VAR]]>Hello World<[[VAR]]>"
   ]

data is just the raw strings from the above link/sample data.

9
  • What type is value here? Is it a bytestring perhaps? Commented Aug 27, 2015 at 10:24
  • Just comes from a JSON file, I'll append that in the question in a moment. Commented Aug 27, 2015 at 10:25
  • And what type are the objects in data? Is each line a unicode object or a str bytestring? Commented Aug 27, 2015 at 10:25
  • @MartijnPieters - ought to be str bytestring but I will double-check and let you know. Commented Aug 27, 2015 at 10:27
  • I'm not sure why you are using 'u''.join() on a string object here. That it just keeping Python busy for no positive effect whatsoever. ''.join() will convert the string to a list of individual characters, then rejoin those again to one string. But by using u'', you also force an implicit conversion to a unicode string. Commented Aug 27, 2015 at 10:32

1 Answer 1

1

At issue is your use of u''.join() here:

u''.join(line.rstrip())

This is pretty useless; it is breaking up the string into individual characters, then rejoining those back into a unicode string again. You were probably aiming for the side-effect of this: implicit conversion to a unicode string.

You could get the same effect with:

unicode(line.rstrip())

which will fail with the exact same error, because neither version tells Python what codec was used for the bytestring to encode your characters.

Decode your lines explicitly; the file you linked to is encoded to UTF-8:

unicode(line.rstrip(), 'utf-8')

or

line.rstrip().decode('utf-8')

Next problem is that your parsed_string object is now a Unicode object too, so you'll need to encode that again when writing to a file:

output.write(parsed_string.encode('utf8'))

or use the io.open() function to open a file object that encodes Unicode strings for you as you write.

You may want to read:

before continuing to fully understand how Python and Unicode work together.

Sign up to request clarification or add additional context in comments.

4 Comments

So after importing the data file into notepad++ it seems it should be UTF-8 w/o BOM and unicode(line.rstrip(), 'utf-8') works for say ╬®Ôëê├ºÔêÜÔê½╦£┬ÁÔëñÔëÑ├À but still chokes on the rest further down the file.
@Juxhin: it decodes your file just fine. You probably have other problems elsewhere.
Actually it seems that notepad++ isn't decoding all the file properly, sublime actually displays all of the characters just fine. I'll just read a bit more about it.
You summed it up nicely in the end. output.write(parsed_string.encode('utf8')) was required as we now have a Unicode object. Works great. Thanks Martijn.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.