I'm trying to parse some html and I have some problem with this little html code.
XML:
<div>
<p><span><a href="../url"></a></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br>
<a class="aaaaa" href="../url">Indice</a>
<p></p>
</div>
code:
import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado
When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3> in it.
If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>.
So how can get this code?
<h3 class="header"><a href="../url">Other</a></h3>
Or only ../url ? which is the part I want!!
Thank you
<br>without a closing tag is illegal XML,lxmlis first and for most an XML parsing library, to enable broken HTML you need to set some flags on the parser. Try using an HTML parser instead or convert your HTML to XHTML.