Using python to edit html, but lxml converts nice html entities to strange encoding

Question

I'm trying to use python (with pyquery and lxml) to alter and clean up some html.

Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

The lxml.html.clean function, clean_html(), works well, except that it replaces the nice html entities like

&#146;

with some unicode string

\xc2\x92

The unicode looks strange in different browsers (firefox and opera using auto encoding, utf8, latin-1, etc), like an empty box. How can I stop lxml converting the entities? How can I get it all in latin-1 encoding? Seems strange that a module built specifically for html would do this.

I can't be sure of which characters are there, so I can't just use

replace("\xc2\x92","&#146;").

I've tried using

clean_html(html).encode('latin-1')

but the unicode persists.

And yes, I'd tell people to stop using word to write html, but then I'd hear the whole

"iz th wayz i liks it u cant mak me chang hitlr".

Edit: a beautifulsoup solution:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
                    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
                    [comment.extract() for comment in comments]
                    print soup

@inspector: As I understand it, BeautifulSoup is no longer under active development, and users are encouraged to seek other solutions (such as lxml). — Seth Johnson
– Seth Johnson, Commented Feb 2, 2011 at 17:02

Steven · Accepted Answer · 2011-02-03 21:18:44Z

There are a few things that - if you know them - will lead to the easiest/best solution:

clean_html() returns the same type you provide it with: if you give it a string, it will return a string, but if you give it an Element or ElementTree, it will return an Element or ElementTree respectively
you can control the way an Element or ElementTree is serialized, by giving encoding options to lxml.html.tostring() method or the tree's write() method (same goes for xml by the way). You can do this with encoding='utf-8' for example.
any content that CAN be encoded in that encoding, will be output as an encoded string, any content that cannot will be "escaped" as entities. Using encoding="ascii" will force any non-ascii characters to "nice" entities like you wish.

Put together, this means: first parse the string into an element (or tree if you wish), clean it, and serialize it as needed:

html = lxml.html.fromstring("<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>")
html = clean_html(html)
result = lxml.html.tostring(html, encoding="ascii")

(and a slightly dirtier trick is to use the errors parameter on the encode() method of a unicode string: try encoding a unicode string containing "special" characters with s.encode('ascii', 'xmlcharrefreplace') and see what that does...)

unutbu · Accepted Answer · 2011-02-02 18:10:30Z

I assume  is supposed to be a quotation mark. The str object with byte value 146, chr(146), decoded with cp1252 is a quotation mark:

In [46]: print(chr(146).decode('cp1252'))
’

So, you could do this:

import lxml.html.clean as clean
import re

html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

html=re.sub('&#(\d+);',lambda m: chr(int(m.group(1))).decode('cp1252'),html)
print(html)
# <div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>
print(type(html))
# <type 'unicode'>
print(clean.clean_html(html))
# <div><p>It’s a spicy meatball!</p></div>

Or,

doc=lh.fromstring(html)
clean.clean(doc)

Note that the quotation mark has unicode code point value 8217. That is, ord(chr(146).decode('cp1252')) equals 8217, so lh.tostring returns:

print(lh.tostring(doc))
# <div><p>It&#8217;s a spicy meatball!</p></div>

You could re-encode it in cp1252 like this:

print(repr(lh.tostring(doc,encoding='cp1252')))
# '<div><p>It\x92s a spicy meatball!</p></div>'

I don't know how to coax lxml to return

'<div><p>It&#146;s a spicy meatball!</p></div>'

to match the output of your BeautifulSoup code, however. Well, clearly it could be done with regex (reversing what I did above), but I don't know if that is necessary or advisable, since lxml should already be returning html that other applications can understand.

result=re.sub('&#(\d+);',lambda m: '&#{n};'.format(
    n=ord(unichr(int(m.group(1))).encode('cp1252'))),
            lh.tostring(doc))
print(result)
# <div><p>It&#146;s a spicy meatball!</p></div>

Laurence Rowe · Accepted Answer · 2011-06-01 21:34:45Z

1

You could also just convert the utf8 string into ascii with xml characters

result = result.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

answered Jun 1, 2011 at 21:34

Laurence Rowe

2,99918 silver badges21 bronze badges

Collectives™ on Stack Overflow

Using python to edit html, but lxml converts nice html entities to strange encoding

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related