I'm using ElementTree to parse an XML document that I have. I am getting the text from the u tags. Some of them have mixed content that I need to filter out or keep as text. Two examples that I have are:
<u>
<vocal type="filler">
<desc>eh</desc>
</vocal>¿Sí?
</u>
<u>Pues...
<vocal type="non-ling">
<desc>laugh</desc>
</vocal>A mí no me suena.
</u>
I want to get the text within the vocal tag if it's type is filler but not if it's type is non-ling.
If I iterate through the children of u, somehow the last text bit is always lost. The only way that I can reach it is by using itertext(). But then the chance to check the type of the vocal tag is lost.
How can I parse it so that I get a result like this:
eh ¿Sí?
Pues... A mí no me suena.
itertext()) and filter out what you don't need. Alternatively you could preprocess the XML with a simple XSLT transform to remove the subtrees you don't want.for t in u_element.iter(): print(t.text)-> This only prints "eh" in the first and "Pues...laugh" in the seconduelement. Not sure I know how to do what you suggest.