I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...
Here's a simplified version of the html:
<html>
<head>
<meta content="text/html" charset="UTF-8"/>
</head>
<body>
<p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
</body
</html
A simplified version of the code:
import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
print(para.text)
The output in my terminal has those little boxes with a character error (for example,
which should be "– E"), but if I copy-paste from there to here, it looks like:
>>> DAÃO bisâe
If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.
The html encoding is UTF-8 (checked with docinfo). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).
Any ideas what's going wrong and/or how to solve?
