2

I am writing some HTML parsers using LXML Xpath feature. It seems to be working fine, but I have one main problem.

When parsing all the HTML <p> tags, there are words that use the tags <b>, <i> and etc. I need to keep those tags.

When parsing the HTML, for example;

<div class="ArticleDetail">
    <p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
    I have a <strong>strong</strong> tag here. I guess this is a silly test.
    <br/>
    Ops, line breaks.
    <br/></p>

If I run this Python code;

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
    print stuff.text_content()

This seems to work fine, but it removes all the other tags instead of p only.

Output:

Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.

As you can see it removed all the <b>, <i> and <strong> tags. Is there anyway you can keep them?

1
  • Thanks for editing, forgot to add those tags to code sample. Commented Sep 5, 2012 at 13:27

1 Answer 1

3

You are currently retrieving only the text content, not the HTML content (which would include tags).

You want to retrieve all child nodes of your XPath match instead:

from lxml import etree

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for elem in x:
    for child in elem.iterdescendants():
        print etree.tostring(child)
Sign up to request clarification or add additional context in comments.

2 Comments

Interesting. I ran the code it works fine, except I see that </br> is being replaced by a \n. Is this normal? if so why, and is there a way I can catch the <\br>? Thanks.
@BenMezger: Your example has invalid <br/> tags (the slash is in the wrong place). They are thus dropped by the parser, the newlines were already there, not the result of the faulty tags.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.