1

I am currently using the xml.etree Python library to parse HTML.

After finding a target DOM element, I am attempting to extract its text. Unfortunately, it seems that the .text attribute is severely limited in its functionality and will only return the immediate inner text of an element (and not anything nested). Do I really have to loop through all the children of the ElementTree? Or is there a more elegant solution?

2 Answers 2

1

You can use itertext(), too. If you don’t like the whitespaces, indention and line break you can use strip().

import xml.etree.ElementTree as ET

html = """<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>"""

root = ET.fromstring(html)

target_element = root.find(".//body")

# get all text
all_text = ''.join(target_element.itertext())

# get all text and remove line break etc.
all_text_clear = ' '.join(all_text.split())

print(all_text)
print(all_text_clear)

Output:

        Moved to example.org
        or example.com.
    
Moved to example.org or example.com.
Sign up to request clarification or add additional context in comments.

Comments

1

The descendant XPath axis should return descendant nodes (including whitespaces)

For example:

//body/descendant::text() or //body/descendant::*/text()

As a generic case

//xpath/to/target/element/descendant::text()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.