1

Suppose I have strings with lots of stuff like

“words words words

Is there a way to convert these through python directly into the characters they represent?

I tried

h = HTMLParser.HTMLParser()
print h.unescape(x)

but got this error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I also tried

print h.unescape(x).encode(utf-8) 

but it encodes

“ as â

when it should be a quote

4
  • what makes you think “ should be a comma? what webpage is this coming from? to convert them to the characters they represent h.unescape(x) does that ... but when you try and print it there are problems ... try looking at its repr Commented Jun 24, 2014 at 20:19
  • i said quote not comma. from the context it is clear it is a quote because these appear at the beginning and end of a string that should have quotes. also this page shows this in the "As a string of HTML entities:" part software.hixie.ch/utilities/cgi/unicode-decoder/… Commented Jun 24, 2014 at 20:26
  • my mistake ... ok that gives me more to work with hold on Commented Jun 24, 2014 at 20:30
  • Are you generating the NCR escapes, or are they coming form external source? If from an external source are they present in the external source? AS others have indicated You have a quotation mark, where the bytes are being escape rather than the character being escape. There is a miss-encoding in the pipeline somewhere. The first step is to identify where in the pipeline this error is occuring. Commented Aug 4, 2023 at 22:48

2 Answers 2

2

“ form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.

You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = '“'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“

If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.

Sign up to request clarification or add additional context in comments.

5 Comments

thank you this was what i was looking for. i am parsing through some web crawler stuff that is going through some messed up pages. the last line worked in terminal though not in sublime text so you're right i need to configure that
Yes, the SublimeText console is not communicating the codec it uses, IIRC.
ahh much better than my method of getting it out of a unicode string +1 nice work
That's not a valid use of &# entities. Those are supposed to be unicode codepoints, so the correct representation would have been “. (“ is also ok in HTML, but you can always use the &#x form even if you're not generating HTML.)
@rici: yup, you can even just include the literal character and encode the document properly to, say, UTF-8. I just picked “ as it is actually defined.
0

the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8

x="“words words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")

there may be a better way to do this in the HTMLParser library ...

1 Comment

Because it is a UTF-8 encoded U+201C LEFT DOUBLE QUOTATION MARK codepoint. It is a Mochibake.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.