2

I'm using ElementTree to parse an XML document that I have. I am getting the text from the u tags. Some of them have mixed content that I need to filter out or keep as text. Two examples that I have are:

<u>
   <vocal type="filler">
     <desc>eh</desc>
   </vocal>¿Sí? 
</u>

<u>Pues... 
   <vocal type="non-ling">
     <desc>laugh</desc>
   </vocal>A mí no me suena. 
</u>

I want to get the text within the vocal tag if it's type is filler but not if it's type is non-ling.

If I iterate through the children of u, somehow the last text bit is always lost. The only way that I can reach it is by using itertext(). But then the chance to check the type of the vocal tag is lost.

How can I parse it so that I get a result like this:

eh ¿Sí? 
Pues... A mí no me suena. 
4
  • You'll have to iterate over all child nodes and text nodes "manually" (i.e. without using itertext()) and filter out what you don't need. Alternatively you could preprocess the XML with a simple XSLT transform to remove the subtrees you don't want. Commented Nov 9, 2017 at 17:38
  • When I iterate manually the last text bit is always lost. Commented Nov 9, 2017 at 17:55
  • Show the code you're using to iterate manually. It's probably recursive, are you remembering to continue iterating after the recursive call to capture following text nodes? Commented Nov 9, 2017 at 17:58
  • for t in u_element.iter(): print(t.text) -> This only prints "eh" in the first and "Pues...laugh" in the second u element. Not sure I know how to do what you suggest. Commented Nov 13, 2017 at 17:00

1 Answer 1

4

The lost text bits, "¿Sí?" and "A mí no me suena.", are available as the tail property of each <vocal> element (the text following the element's end tag).

Here is a way to get the wanted output (tested with Python 2.7).

Assume that vocal.xml looks like this:

<root>
  <u>
    <vocal type="filler">
      <desc>eh</desc>
    </vocal>¿Sí? 
  </u>

  <u>Pues... 
     <vocal type="non-ling">
       <desc>laugh</desc>
     </vocal>A mí no me suena. 
  </u>
</root>

Code:

from xml.etree import ElementTree as ET

root = ET.parse("vocal.xml") 

for u in root.findall(".//u"):
    v = u.find("vocal")

    if v.get("type") == "filler":
        frags = [u.text, v.findtext("desc"), v.tail]
    else:
        frags = [u.text, v.tail]

    print " ".join(t.encode("utf-8").strip() for t in frags).strip()

Output:

eh ¿Sí?
Pues... A mí no me suena.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.