Parse html with lxml (tag h3)

Question

I'm trying to parse some html and I have some problem with this little html code.

XML:

<div>
    <p><span><a href="../url"></a></span></p>
    <h3 class="header"><a href="../url">Other</a></h3>
    <a href="../url">Other</a><br>
    <a class="aaaaa" href="../url">Indice</a>
    <p></p>               
</div>

code:

import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado

When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it. If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.

So how can get this code?

<h3 class="header"><a href="../url">Other</a></h3>

Or only ../url ? which is the part I want!!

Thank you

what you posted is not XML compliant, <br> without a closing tag is illegal XML, lxml is first and for most an XML parsing library, to enable broken HTML you need to set some flags on the parser. Try using an HTML parser instead or convert your HTML to XHTML. — user177800
– user177800, Commented Oct 26, 2011 at 22:50
But I have parse a lot of pages without problem, with <br>!! So, what flags I need to use? Because I really like this parser, it's really fast!! — dani
– dani, Commented Oct 26, 2011 at 22:54

ekhumoro · Accepted Answer · 2011-10-26 23:31:30Z

4

The XPath query in your example is not quite right.

To get a list of all h3 tags within div tags, you should use this:

elements = tree.xpath('//div/h3')
etree.tostring(elements[0])

Which should give:

'<h3 class="header"><a href="../url">Other</a></h3>\n'

To get a list of all href attributes of a tags within h3 tags, you could use something like this:

tree.xpath('//h3/a/@href')

Which gives:

['../url']

answered Oct 26, 2011 at 23:31

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dani Over a year ago

Thank you so much!!!!! That works for me!! I think that I have to learn more about xpath. Thank you

Pavel Shvedov · Accepted Answer · 2011-10-26 23:03:12Z

3

The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree. So, instead of what you intended, if you use etree.tostring(tree) you get

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p><span><a href="../url"/></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br/><a class="aaaaa" href="../url">Indice</a>
<p/>

So, the correct xpath would be '/html/body/div/h3'

answered Oct 26, 2011 at 23:03

Pavel Shvedov

1,31411 silver badges8 bronze badges

2 Comments

dani Over a year ago

It doesn't work! :( This is a part of a big document and xpath is '/html/body/......./div/h3'. And it doesn't work. It's a problem with h3. Because I can read until div. And then it doen't reconize h3 tag.

Pavel Shvedov Over a year ago

Could you please sample the whole document? Obviously, the part is not enough to find out what's the problem with your structure. Of course, you can stick to the second answer on this question, but it's not optimal :) Or, in fact, the case may be that HTMLParser() by default fixes the broken HTML, so if it's broken and then fixed, it could contain extra HTML tags, try using to_string() and looking at the structure again.

Collectives™ on Stack Overflow

Parse html with lxml (tag h3)

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related