Python - Keeping some HTML tags with lxml Xpath feature

Question

I am writing some HTML parsers using LXML Xpath feature. It seems to be working fine, but I have one main problem.

When parsing all the HTML  tags, there are words that use the tags ,  and etc. I need to keep those tags.

When parsing the HTML, for example;

<div class="ArticleDetail">
    <p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
    I have a <strong>strong</strong> tag here. I guess this is a silly test.
    <br/>
    Ops, line breaks.
    <br/></p>

If I run this Python code;

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
    print stuff.text_content()

This seems to work fine, but it removes all the other tags instead of p only.

Output:

Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.

As you can see it removed all the ,  and  tags. Is there anyway you can keep them?

Thanks for editing, forgot to add those tags to code sample. — user689383
– user689383, Commented Sep 5, 2012 at 13:27

Martijn Pieters · Accepted Answer · 2012-09-05 13:28:23Z

3

You are currently retrieving only the text content, not the HTML content (which would include tags).

You want to retrieve all child nodes of your XPath match instead:

from lxml import etree

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for elem in x:
    for child in elem.iterdescendants():
        print etree.tostring(child)

answered Sep 5, 2012 at 13:28

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user689383 Over a year ago

Interesting. I ran the code it works fine, except I see that is being replaced by a \n. Is this normal? if so why, and is there a way I can catch the <\br>? Thanks.

Martijn Pieters Over a year ago

@BenMezger: Your example has invalid   tags (the slash is in the wrong place). They are thus dropped by the parser, the newlines were already there, not the result of the faulty tags.

Collectives™ on Stack Overflow

Python - Keeping some HTML tags with lxml Xpath feature

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related