I'm trying to use python (with pyquery and lxml) to alter and clean up some html.
Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>"
The lxml.html.clean function, clean_html(), works well, except that it replaces the nice html entities like
’
with some unicode string
\xc2\x92
The unicode looks strange in different browsers (firefox and opera using auto encoding, utf8, latin-1, etc), like an empty box. How can I stop lxml converting the entities? How can I get it all in latin-1 encoding? Seems strange that a module built specifically for html would do this.
I can't be sure of which characters are there, so I can't just use
replace("\xc2\x92","’").
I've tried using
clean_html(html).encode('latin-1')
but the unicode persists.
And yes, I'd tell people to stop using word to write html, but then I'd hear the whole
"iz th wayz i liks it u cant mak me chang hitlr".
Edit: a beautifulsoup solution:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup