Take the following sample data (whole list can be found here):
Ω≈ç√∫˜µ≤≥÷
åß∂ƒ©˙∆˚¬…æ
œ∑´®†¥¨ˆøπ“‘
¡™£¢∞§¶•ªº–≠
¸˛Ç◊ı˜Â¯˘¿
ÅÍÎÏ˝ÓÔÒÚÆ☃
Œ„´‰ˇÁ¨ˆØ∏”’
ヽ༼ຈل͜ຈ༽ノ ヽ༼ຈل͜ຈ༽ノ
(。◕ ∀ ◕。)
`ィ(´∀`∩
_ _ロ(,_,*)
・( ̄∀ ̄)・:*:
I have been outputting the data from the aformentioned dump of string to a separate HTML files (there is no need to get into detail as this is irrelevant to the question) like so:
for value in tags['tags']:
for line in data:
with open('./output/fuzzml' + str(file_count), 'w') as output:
parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
output.write(parsed_string)
file_count += 1
Which works nicely for a relatively small portion of the data dump until it comes across some of the tricky symbols like the ones above. I have modified line 5 (u''.join(line.rstrip())) multiple times in hopes of writing in a way that will output anything correctly however it will always get stuck at some point and will raise an UnicodeDecodeError exception:
Traceback (most recent call last):
File "generate-html.py", line 37, in <module>
main()
File "generate-html.py", line 34, in main
generate_html(tag_file, data_file)
File "generate-html.py", line 18, in generate_html
parsed_string = value.replace('[[VAR]]', u''.join(line.rstrip()))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
The tags are extracted from a JSON file with the following sample set:
"tags": [
"<img src=\"[[VAR]]\">",
"<a href=\"[[VAR]]\"><img src=\"[[VAR]]\">",
"<script>[[VAR]]</script>",
"<[[VAR]]>Hello World<[[VAR]]>"
]
data is just the raw strings from the above link/sample data.
valuehere? Is it a bytestring perhaps?data? Is each line aunicodeobject or astrbytestring?strbytestring but I will double-check and let you know.'u''.join()on a string object here. That it just keeping Python busy for no positive effect whatsoever.''.join()will convert the string to a list of individual characters, then rejoin those again to one string. But by usingu'', you also force an implicit conversion to aunicodestring.