3

I have a xml file, which I need to convert to utf8. Unfortunately the entities contain text like this:

/mytext,

I'm using the codec library to convert files to utf8, but html entities won't work with it.

Is there an easy way to get rid of the html encoding?

Thanks

2
  • stackoverflow.com/questions/37486/… Commented Feb 28, 2012 at 17:56
  • Can you just pass the raw file through an unescape first? Commented Feb 28, 2012 at 17:56

2 Answers 2

3

You can pass the text of the file through an unescape function before passing it to the XML parser.

Alternatively, if you're only parsing HTML, lxml's http parser does this for you:

>>> import lxml.html
>>> html = lxml.html.fromstring("<html><body><p>&#047;mytext&#044;</p></body></html>")
>>> lxml.html.tostring(html)
'<html><body><p>/mytext,</p></body></html>'
Sign up to request clarification or add additional context in comments.

Comments

1

Recently posted the below in response to a similar question:

import HTMLParser     # html.parser in Python 3
h = HTMLParser.HTMLParser()
h.unescape('&#047;mytext&#044;')

Technically this method is "internal" and undocumented, but it's been in the API quite a while and isn't marked with a leading underscore.

Found it here; other approaches are also mentioned, of which BeautifulSoup is probably the best if you don't mind its "heaviness."

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.