HTMLParser.HTMLParser().unescape() doesn't work

Question

I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc.

I've read several posts regarding this question

Converting html source content into readable format with Python 2.x

Convert XML/HTML Entities into Unicode String in Python

and according to them, I chose to use the undocumented function unescape(), but it doesn't work for me...

My code sample is like:

import HTMLParser

htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('&copy; 2013')
print decoded

When I ran this python script, the output is still:

&copy; 2013

instead of

© 2013

I'm using Python 2.X, working on Windows 7 and Cygwin console. I googled and didn't find any similar problems..Could anyone help me with this?

I have tried calling it from the command line and the IDLE, and it does work for me (Python 2.7 on Windows 8). — A. Rodas
– A. Rodas, Commented Jul 19, 2013 at 16:55

DrMeers · Accepted Answer · 2022-04-29 01:28:20Z

8

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6.

Python 2.5:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('&copy;')
'&copy;'

Python 2.6/2.7:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('&copy;')
u'\xa9'

UPDATE: Python 3.4+:

>>> import html
>>> html.unescape('&copy;')
'©'

See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation

edited Apr 29, 2022 at 1:28

answered Apr 4, 2014 at 10:26

DrMeers

4,2072 gold badges38 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jfs Over a year ago

it is html.unescape() in Python 3.4+

charlesreid1 Over a year ago

*in Python 3.4-3.8

andorov · Accepted Answer · 2022-01-20 21:37:42Z

7

Starting in python 3.9 using HTMLParser()unescape(<str>) will result in the error AttributeError: 'HTMLParser' object has no attribute 'unescape'

You can update it to:

import html
html.unescape(<str>)

answered Jan 20, 2022 at 21:37

andorov

4,3463 gold badges42 silver badges52 bronze badges

1 Comment

Francois Over a year ago

Exatcly what I was looking for python 3.9

Aleksi · Accepted Answer · 2013-07-19 19:28:57Z

1

This site lists some solutions, here's one of them:

from xml.sax.saxutils import escape, unescape

html_escape_table = {
    '"': "&quot;",
    "'": "&apos;",
    "©": "&copy;"
    # etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text, html_unescape_table)

Not the prettiest thing though, since you would have to list each escaped symbol manually.

EDIT:

How about this?

import htmllib

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

edited Jul 19, 2013 at 19:28

answered Jul 19, 2013 at 17:15

Aleksi

5,14639 silver badges51 bronze badges

1 Comment

D.Q. Over a year ago

hi, thank you for your answer. But the content of my html page is unknown, so unless I listed all the html special characters...

Collectives™ on Stack Overflow

HTMLParser.HTMLParser().unescape() doesn't work

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related