0

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...

Here's a simplified version of the html:

<html>
    <head>
        <meta content="text/html" charset="UTF-8"/>
    </head>
    <body>
        <p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
    </body
</html

A simplified version of the code:

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)

The output in my terminal has those little boxes with a character error (for example,

enter image description here

which should be "– E"), but if I copy-paste from there to here, it looks like:

>>> DAÃO bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.

The html encoding is UTF-8 (checked with docinfo). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).

Any ideas what's going wrong and/or how to solve?

1
  • Having coded a couple of basic websites myself, it seems strange that these characters, ñ – etc, are in the source html in the first place. I would expect them not to render correctly in the browser, but they do. Maybe this is related to the problem. Commented Dec 11, 2019 at 9:36

2 Answers 2

1

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.

Incoming data from the server:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). But byte 91 is unused in Latin-1, so there is no character to display. I took ▯ to make it visible. A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.

When writing the improperly decoded string to file:

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.

The à got converted to C3 83, which is correct for this character in UTF-8. Note how the byte sequence now matches what you told me in the comments (\xc3\x83\xc2\x91).

When reading that file:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again. The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.

Sign up to request clarification or add additional context in comments.

15 Comments

No, that's not strange. HTML supports any character that the chosen file encoding can represent. ñ and &ntilde; are 100% the same thing, but the latter may be easier to enter when you don't have the right keyboard layout for typing an actual ñ, or when the chosen file encoding cannot represent the actual ñ (e.g. to get ñ into an ASCII-encoded file).
It's hard to say what encoding your file is in (it might already be broken depending on how the file was produced in the first place). Only looking at the bytes at the location in question can clarify this.
You scraped them? How? And how did you save them? What does the text editor detect as the file encoding? Also, tell me the byte sequence at the offending position.
Yes, the file is already broken. Can you share the URL you have downloaded it from and the commandline you used to save it?
Here is what I think has happened during download. Originally the data was proper UTF-8, that got interpreted as Latin-1, that got written to file as UTF-8, and thus ended up garbled. Reading the file as UTF-8 cannot fix this. Quite possibly the error is on your side due to mistakes when saving the file that's why it's important to know how you did that. (It's possible to fix the file while reading it, but it's much more sensible to not break it in the first place)
|
0

I've found the unidecode package to work quite well converting non-ascii characters to the closest ascii.

from unidecode import unidecode
def check_ascii(in_string):
    if in_string.isascii():  # Available in python 3.7+
        return in_string
    else:
        return unidecode(in_string)  # Converts non-ascii characters to the closest ascii

Then if you believe some text might contain non-ascii characters you can pass it to the above function. In your case after extracting the text between the html tags you can pass it with:

for para in root.iter("p"):
    print(check_ascii(para.text))

You can find details about the package here: https://pypi.org/project/Unidecode/

3 Comments

Thanks for answering. This works only in the sense that it gets rid of those ugly boxes, but the character rendering is still wrong. Ñ becomes A, ' becomes a, ñ becomes A+-, ndash disappears completely.
You don't want your file to be converted to ASCII.
I see, sorry about that, it might be a problem of the original source then. @Tomalak the idea was not to convert the file to ASCII but the extracted text

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.