Python lxml & string encoding issue

Question

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...

Here's a simplified version of the html:

<html>
    <head>
        <meta content="text/html" charset="UTF-8"/>
    </head>
    <body>
        <p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
    </body
</html

A simplified version of the code:

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)

The output in my terminal has those little boxes with a character error (for example,

which should be "– E"), but if I copy-paste from there to here, it looks like:

>>> DAÃO bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.

The html encoding is UTF-8 (checked with docinfo). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).

Any ideas what's going wrong and/or how to solve?

Having coded a couple of basic websites myself, it seems strange that these characters, ñ – etc, are in the source html in the first place. I would expect them not to render correctly in the browser, but they do. Maybe this is related to the problem. — Bob
– Bob, Commented Dec 11, 2019 at 9:36

Tomalak · Accepted Answer · 2019-12-11 17:32:11Z

1

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.

Incoming data from the server:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). But byte 91 is unused in Latin-1, so there is no character to display. I took ▯ to make it visible. A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.

When writing the improperly decoded string to file:

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.

The Ã got converted to C3 83, which is correct for this character in UTF-8. Note how the byte sequence now matches what you told me in the comments (\xc3\x83\xc2\x91).

When reading that file:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃƒÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again. The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.

edited Dec 11, 2019 at 17:32

answered Dec 11, 2019 at 9:24

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Tomalak Over a year ago

No, that's not strange. HTML supports any character that the chosen file encoding can represent. ñ and ñ are 100% the same thing, but the latter may be easier to enter when you don't have the right keyboard layout for typing an actual ñ, or when the chosen file encoding cannot represent the actual ñ (e.g. to get ñ into an ASCII-encoded file).

Tomalak Over a year ago

It's hard to say what encoding your file is in (it might already be broken depending on how the file was produced in the first place). Only looking at the bytes at the location in question can clarify this.

Tomalak Over a year ago

You scraped them? How? And how did you save them? What does the text editor detect as the file encoding? Also, tell me the byte sequence at the offending position.

Tomalak Over a year ago

Yes, the file is already broken. Can you share the URL you have downloaded it from and the commandline you used to save it?

Tomalak Over a year ago

Here is what I think has happened during download. Originally the data was proper UTF-8, that got interpreted as Latin-1, that got written to file as UTF-8, and thus ended up garbled. Reading the file as UTF-8 cannot fix this. Quite possibly the error is on your side due to mistakes when saving the file that's why it's important to know how you did that. (It's possible to fix the file while reading it, but it's much more sensible to not break it in the first place)

|

Ferran · Accepted Answer · 2019-12-11 09:18:30Z

0

I've found the unidecode package to work quite well converting non-ascii characters to the closest ascii.

from unidecode import unidecode
def check_ascii(in_string):
    if in_string.isascii():  # Available in python 3.7+
        return in_string
    else:
        return unidecode(in_string)  # Converts non-ascii characters to the closest ascii

Then if you believe some text might contain non-ascii characters you can pass it to the above function. In your case after extracting the text between the html tags you can pass it with:

for para in root.iter("p"):
    print(check_ascii(para.text))

You can find details about the package here: https://pypi.org/project/Unidecode/

edited Dec 11, 2019 at 9:18

answered Dec 11, 2019 at 9:12

Ferran

8409 silver badges19 bronze badges

3 Comments

Bob Over a year ago

Thanks for answering. This works only in the sense that it gets rid of those ugly boxes, but the character rendering is still wrong. Ñ becomes A, ' becomes a, ñ becomes A+-, ndash disappears completely.

Tomalak Over a year ago

You don't want your file to be converted to ASCII.

Ferran Over a year ago

I see, sorry about that, it might be a problem of the original source then. @Tomalak the idea was not to convert the file to ASCII but the extracted text

Collectives™ on Stack Overflow

Python lxml & string encoding issue

2 Answers 2

15 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

15 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related