1

I'm trying to parse some html and I have some problem with this little html code.

XML:

<div>
    <p><span><a href="../url"></a></span></p>
    <h3 class="header"><a href="../url">Other</a></h3>
    <a href="../url">Other</a><br>
    <a class="aaaaa" href="../url">Indice</a>
    <p></p>               
</div>

code:

import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado

When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it. If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.

So how can get this code?

<h3 class="header"><a href="../url">Other</a></h3>

Or only ../url ? which is the part I want!!

Thank you

2
  • what you posted is not XML compliant, <br> without a closing tag is illegal XML, lxml is first and for most an XML parsing library, to enable broken HTML you need to set some flags on the parser. Try using an HTML parser instead or convert your HTML to XHTML. Commented Oct 26, 2011 at 22:50
  • But I have parse a lot of pages without problem, with <br>!! So, what flags I need to use? Because I really like this parser, it's really fast!! Commented Oct 26, 2011 at 22:54

2 Answers 2

4

The XPath query in your example is not quite right.

To get a list of all h3 tags within div tags, you should use this:

elements = tree.xpath('//div/h3')
etree.tostring(elements[0])

Which should give:

'<h3 class="header"><a href="../url">Other</a></h3>\n'

To get a list of all href attributes of a tags within h3 tags, you could use something like this:

tree.xpath('//h3/a/@href')

Which gives:

['../url']
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much!!!!! That works for me!! I think that I have to learn more about xpath. Thank you
3

The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree. So, instead of what you intended, if you use etree.tostring(tree) you get

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p><span><a href="../url"/></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br/><a class="aaaaa" href="../url">Indice</a>
<p/>               

So, the correct xpath would be '/html/body/div/h3'

2 Comments

It doesn't work! :( This is a part of a big document and xpath is '/html/body/......./div/h3'. And it doesn't work. It's a problem with h3. Because I can read until div. And then it doen't reconize h3 tag.
Could you please sample the whole document? Obviously, the part is not enough to find out what's the problem with your structure. Of course, you can stick to the second answer on this question, but it's not optimal :) Or, in fact, the case may be that HTMLParser() by default fixes the broken HTML, so if it's broken and then fixed, it could contain extra HTML tags, try using to_string() and looking at the structure again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.