0

I have a text that is part of html. I would like to save it to a file.

This works fine in debug mode in Eclipse, but fails on runtime from shell. I am using a short example of html that fails.

xx = '<input type="hidden" name="charset_test" value="€,´,€,´,水,Д,Є" />'
with codecs.open('myfile.htm'), 'wb', encoding="utf-8") as output:
    output.write(data)

and I get:

 Exception 'ascii' codec can't decode byte 0xe2 in position XXX: ordinal not in range(128)

where XXX is the position in the relevant file of the "strange" symbols, such as the EURO sign.

Why is this working from Eclipse and not from shell? How do I solve this?

I tried

HTMLParser.HTMLParser().unescape()
unquote()
unicode()

Nothing worked...

7
  • Is xx an actual variable in the code or just a fragment of the file you're giving as the example? Commented Apr 29, 2013 at 13:56
  • it's just a fragment, I zoomed in on it using the XXX location, since the original is a very big file Commented Apr 29, 2013 at 17:12
  • Haven't you solved yet the problem? The file you're trying to process, what's its encoding? Commented Apr 29, 2013 at 17:25
  • it's not a file, it's a result of a url call to a remote file. Commented Apr 29, 2013 at 17:30
  • 1
    It is in the connection header. If you have Firefox and Firebug for example, you can see it within the Net tab. There you have the get requests sent and the header where is defined the charset is Content-Type which for this page is Content-Type text/html; charset=utf-8 Commented Apr 30, 2013 at 12:39

1 Answer 1

1

The following code works for me...

# coding=utf-8

import codecs

data = '<input type="hidden" name="charset_test" value="€,´,€,´,水,Д,Є" />'
with codecs.open('myfile.htm', 'wb', encoding="utf-8") as output:
    output.write(data.decode('utf-8'))

...but if the source data is already UTF-8 encoded, and you also want to write UTF-8 data, there's no need to decode it to a Python unicode object, then re-encode back to UTF-8. You can just do...

# coding=utf-8

data = '<input type="hidden" name="charset_test" value="€,´,€,´,水,Д,Є" />'
with open('myfile.htm', 'wb') as output:
    output.write(data)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.