Get text from mixed element xml tags with ElementTree

Question

I'm using ElementTree to parse an XML document that I have. I am getting the text from the u tags. Some of them have mixed content that I need to filter out or keep as text. Two examples that I have are:

<u>
   <vocal type="filler">
     <desc>eh</desc>
   </vocal>¿Sí? 
</u>

<u>Pues... 
   <vocal type="non-ling">
     <desc>laugh</desc>
   </vocal>A mí no me suena. 
</u>

I want to get the text within the vocal tag if it's type is filler but not if it's type is non-ling.

If I iterate through the children of u, somehow the last text bit is always lost. The only way that I can reach it is by using itertext(). But then the chance to check the type of the vocal tag is lost.

How can I parse it so that I get a result like this:

eh ¿Sí? 
Pues... A mí no me suena.

You'll have to iterate over all child nodes and text nodes "manually" (i.e. without using itertext()) and filter out what you don't need. Alternatively you could preprocess the XML with a simple XSLT transform to remove the subtrees you don't want. — Jim Garrison
– Jim Garrison, Commented Nov 9, 2017 at 17:38
Show the code you're using to iterate manually. It's probably recursive, are you remembering to continue iterating after the recursive call to capture following text nodes? — Jim Garrison
– Jim Garrison, Commented Nov 9, 2017 at 17:58
for t in u_element.iter(): print(t.text) -> This only prints "eh" in the first and "Pues...laugh" in the second u element. Not sure I know how to do what you suggest. — alpoktem
– alpoktem, Commented Nov 13, 2017 at 17:00

mzjn · Accepted Answer · 2019-04-07 07:23:47Z

4

The lost text bits, "¿Sí?" and "A mí no me suena.", are available as the tail property of each <vocal> element (the text following the element's end tag).

Here is a way to get the wanted output (tested with Python 2.7).

Assume that vocal.xml looks like this:

<root>
  <u>
    <vocal type="filler">
      <desc>eh</desc>
    </vocal>¿Sí? 
  </u>

  <u>Pues... 
     <vocal type="non-ling">
       <desc>laugh</desc>
     </vocal>A mí no me suena. 
  </u>
</root>

Code:

from xml.etree import ElementTree as ET

root = ET.parse("vocal.xml") 

for u in root.findall(".//u"):
    v = u.find("vocal")

    if v.get("type") == "filler":
        frags = [u.text, v.findtext("desc"), v.tail]
    else:
        frags = [u.text, v.tail]

    print " ".join(t.encode("utf-8").strip() for t in frags).strip()

Output:

eh ¿Sí?
Pues... A mí no me suena.

edited Apr 7, 2019 at 7:23

answered Nov 15, 2017 at 6:28

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Get text from mixed element xml tags with ElementTree

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related