Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags

Question

Using lxml in python I created this xpath syntax

htmlPage.xpath("/html/body//a/text()")

It gets me all <a>-tags in certain html scopes I desire. Now I encountered that the <a>-tags could look like this:

<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>

xpath returns me a list that has one element more then I expect. I checked that and recognized, that it splits the <a>-tag mentioned above into two list elements, instead of one. Instead of the string

"This is a sentence with some italic text-formatting I want to parse."

I get the two strings

"This is a sentence with some" # and
"-formatting I want to parse."

Is there a way to correct that?

Aufwind · Accepted Answer · 2011-05-30 12:14:24Z

2

I solved my problem by first getting all <a>-tags

results = htmlPage.xpath("/html/body//a")

and then iterating the returned list and using text_content() on the list elements

for a_tag in results:
    print a_tag.text_content() # prints bthe whol string: "This is a sentence with some italic text-formatting I want to parse."

edited May 30, 2011 at 12:14

answered May 30, 2011 at 11:16

Aufwind

26.4k41 gold badges113 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related