7

I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc.

I've read several posts regarding this question

Converting html source content into readable format with Python 2.x

Decode HTML entities in Python string?

Convert XML/HTML Entities into Unicode String in Python

and according to them, I chose to use the undocumented function unescape(), but it doesn't work for me...

My code sample is like:

import HTMLParser

htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('© 2013')
print decoded

When I ran this python script, the output is still:

© 2013

instead of

© 2013

I'm using Python 2.X, working on Windows 7 and Cygwin console. I googled and didn't find any similar problems..Could anyone help me with this?

1
  • I have tried calling it from the command line and the IDLE, and it does work for me (Python 2.7 on Windows 8). Commented Jul 19, 2013 at 16:55

3 Answers 3

8

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6.

Python 2.5:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
'©'

Python 2.6/2.7:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'

UPDATE: Python 3.4+:

>>> import html
>>> html.unescape('©')
'©'

See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation

Sign up to request clarification or add additional context in comments.

2 Comments

it is html.unescape() in Python 3.4+
*in Python 3.4-3.8
7

Starting in python 3.9 using HTMLParser()unescape(<str>) will result in the error AttributeError: 'HTMLParser' object has no attribute 'unescape'

You can update it to:

import html
html.unescape(<str>)

1 Comment

Exatcly what I was looking for python 3.9
1

This site lists some solutions, here's one of them:

from xml.sax.saxutils import escape, unescape

html_escape_table = {
    '"': "&quot;",
    "'": "&apos;",
    "©": "&copy;"
    # etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text, html_unescape_table)

Not the prettiest thing though, since you would have to list each escaped symbol manually.

EDIT:

How about this?

import htmllib

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

1 Comment

hi, thank you for your answer. But the content of my html page is unknown, so unless I listed all the html special characters...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.