I am writing some HTML parsers using LXML Xpath feature. It seems to be working fine, but I have one main problem.
When parsing all the HTML <p> tags, there are words that use the tags <b>, <i> and etc. I need to keep those tags.
When parsing the HTML, for example;
<div class="ArticleDetail">
<p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
I have a <strong>strong</strong> tag here. I guess this is a silly test.
<br/>
Ops, line breaks.
<br/></p>
If I run this Python code;
x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
print stuff.text_content()
This seems to work fine, but it removes all the other tags instead of p only.
Output:
Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.
As you can see it removed all the <b>, <i> and <strong> tags. Is there anyway you can keep them?