Python Unencode unicode html hexadecimal

Question

Suppose I have strings with lots of stuff like

&#x00e2;&#x0080;&#x009c;words words words

Is there a way to convert these through python directly into the characters they represent?

I tried

h = HTMLParser.HTMLParser()
print h.unescape(x)

but got this error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I also tried

print h.unescape(x).encode(utf-8)

but it encodes

â as â

when it should be a quote

what makes you think â should be a comma? what webpage is this coming from? to convert them to the characters they represent h.unescape(x) does that ... but when you try and print it there are problems ... try looking at its repr — Joran Beasley
– Joran Beasley, Commented Jun 24, 2014 at 20:19
i said quote not comma. from the context it is clear it is a quote because these appear at the beginning and end of a string that should have quotes. also this page shows this in the "As a string of HTML entities:" part software.hixie.ch/utilities/cgi/unicode-decoder/… — user3752900
– user3752900, Commented Jun 24, 2014 at 20:26
Are you generating the NCR escapes, or are they coming form external source? If from an external source are they present in the external source? AS others have indicated You have a quotation mark, where the bytes are being escape rather than the character being escape. There is a miss-encoding in the pipeline somewhere. The first step is to identify where in the pipeline this error is occuring. — Andj
– Andj, Commented Aug 4, 2023 at 22:48

Martijn Pieters · Accepted Answer · 2014-06-24 20:30:06Z

2

â form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.

You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> x = '&#x00e2;&#x0080;&#x009c;'
>>> h.unescape(x)
u'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1')
'\xe2\x80\x9c'
>>> h.unescape(x).encode('latin1').decode('utf8')
u'\u201c'
>>> print h.unescape(x).encode('latin1').decode('utf8')
“

If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.

answered Jun 24, 2014 at 20:30

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3752900 Over a year ago

thank you this was what i was looking for. i am parsing through some web crawler stuff that is going through some messed up pages. the last line worked in terminal though not in sublime text so you're right i need to configure that

Martijn Pieters Over a year ago

Yes, the SublimeText console is not communicating the codec it uses, IIRC.

Joran Beasley Over a year ago

ahh much better than my method of getting it out of a unicode string +1 nice work

rici Over a year ago

That's not a valid use of &# entities. Those are supposed to be unicode codepoints, so the correct representation would have been “. (“ is also ok in HTML, but you can always use the &#x form even if you're not generating HTML.)

Martijn Pieters Over a year ago

@rici: yup, you can even just include the literal character and encode the document properly to, say, UTF-8. I just picked “ as it is actually defined.

Joran Beasley · Accepted Answer · 2014-06-24 20:40:56Z

0

the problem is that you cannot decode unicode properly ... you need to convert it away from unicode to just utf8

x="&#x00e2;&#x0080;&#x009c;words words words"
h = HTMLParser.HTMLParser()
msg=h.unescape(x) #this converts it to unicode string ..
downcast = "".join(chr(ord(c)&0xff) for c in msg) #convert it to normal string (python2)
print downcast.decode("utf8")

there may be a better way to do this in the HTMLParser library ...

edited Jun 24, 2014 at 20:40

answered Jun 24, 2014 at 20:28

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

1 Comment

Martijn Pieters Over a year ago

Because it is a UTF-8 encoded U+201C LEFT DOUBLE QUOTATION MARK codepoint. It is a Mochibake.

Collectives™ on Stack Overflow

Python Unencode unicode html hexadecimal

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related