0

I'm trying to parse XML document to return <input> nodes that contain a ref attribute. A toy example works but the document itself returns an empty array, when it should show a match.

toy example

import elementtree.ElementTree
from lxml import etree
tree = etree.XML('<body><input ref="blabla"><label>Cats</label></input><input ref="blabla"><label>Dogs</label></input><input ref="blabla"><label>Birds</label></input></body>')
# I can return the relevant input nodes with:
print len(tree.findall(".//input[@ref]"))
2

But working with the following (reduced) file for some reason fails:

example.xml

<?xml version="1.0"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <h:head>
    <h:title>A title</h:title>
  </h:head>
  <h:body>
    <group ref="blabla">
      <label>Group 1</label>
      <input ref="blabla">
        <label>Field 1</label>
      </input>
    </group>
  </h:body>
</h:html>

script

import elementtree.ElementTree
from lxml import etree
with open ("example.xml", "r") as myfile:
  xml = myfile.read()
tree = etree.XML(xml)
print len(tree.findall(".//input[@ref]"))
0

Any idea why this fails, and how to workaround? I think it may have something to do with the XML header. Very grateful for any assistance.

1
  • What's the error message? What exactly fails? Commented Aug 26, 2015 at 0:23

2 Answers 2

2

I think the problem is that the elements in your entire document are in particular namespaces, so that the un-namespaced .findall(".//input[@ref]")) expression doesn't match the input element in the document, which is actually a namespaced input element, in the http://www.w3.org/2002/xforms namespace.

So maybe try this:

.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")

Updated after my original answer, to use the xforms namespace instead of the xhtml namespace (as had been noted in another answer).

Sign up to request clarification or add additional context in comments.

3 Comments

Hi sideshowbarker. Sorry still an empty array for me.
OK, I didn’t actually test it, but will do that right now and see what I get
Ha ha! .findall(".//{http://www.w3.org/2002/xforms}input[@ref]") is the ticket :)
2

As can be seen from your xml , the xml-namespace for non-prefixed elements is - "http://www.w3.org/2002/xforms" , This is because that is defined as the xmlns without any prefix in the parent element h:html , only elements prefixed h: have the namespace as "http://www.w3.org/1999/xhtml".

So you need to use that namespace in your query as well. Example -

root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")

Example/Demo -

>>> s = """<?xml version="1.0"?>
... <h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
...   <h:head>
...     <h:title>A title</h:title>
...   </h:head>
...   <h:body>
...     <group ref="blabla">
...       <label>Group 1</label>
...       <input ref="blabla">
...         <label>Field 1</label>
...       </input>
...     </group>
...   </h:body>
... </h:html>"""
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(s)
>>> root.findall(".//{http://www.w3.org/1999/xhtml}input[@ref]")
>>> root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")
[<Element '{http://www.w3.org/2002/xforms}input' at 0x02288EA0>]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.