Python convert html ascii encoded text to utf8

Question

I have a xml file, which I need to convert to utf8. Unfortunately the entities contain text like this:

&#047;mytext&#044;

I'm using the codec library to convert files to utf8, but html entities won't work with it.

Is there an easy way to get rid of the html encoding?

Thanks

stackoverflow.com/questions/37486/…

kechap
– kechap

2012-02-28 17:56:16 +00:00
Commented Feb 28, 2012 at 17:56 — kechap
– kechap, Commented Feb 28, 2012 at 17:56
Can you just pass the raw file through an unescape first?

jterrace
– jterrace

2012-02-28 17:56:16 +00:00
Commented Feb 28, 2012 at 17:56 — jterrace
– jterrace, Commented Feb 28, 2012 at 17:56

jterrace · Accepted Answer · 2012-02-28 17:57:49Z

3

You can pass the text of the file through an unescape function before passing it to the XML parser.

Alternatively, if you're only parsing HTML, lxml's http parser does this for you:

>>> import lxml.html
>>> html = lxml.html.fromstring("<html><body><p>&#047;mytext&#044;</p></body></html>")
>>> lxml.html.tostring(html)
'<html><body><p>/mytext,</p></body></html>'

answered Feb 28, 2012 at 17:57

jterrace

67.5k24 gold badges164 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

kindall · Accepted Answer · 2012-02-28 18:52:33Z

1

Recently posted the below in response to a similar question:

import HTMLParser     # html.parser in Python 3
h = HTMLParser.HTMLParser()
h.unescape('&#047;mytext&#044;')

Technically this method is "internal" and undocumented, but it's been in the API quite a while and isn't marked with a leading underscore.

Found it here; other approaches are also mentioned, of which BeautifulSoup is probably the best if you don't mind its "heaviness."

answered Feb 28, 2012 at 18:52

kindall

185k36 gold badges291 silver badges321 bronze badges

Collectives™ on Stack Overflow

Python convert html ascii encoded text to utf8

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related