I have an xml file that comes from a doc (MS Word 2003, so I can't use docx library). I'm using lxml to parse it. I can get most of the text (everything is in <txt> nodes) but there are some nodes with the following structure:
<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
<infos>
<bounds left="1521" top="851" width="10517" height="322"/>
</infos>
The text I want to extract <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
<Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
<Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
<Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
<LineBreak nWidth="10518"/>
<Finish/>
</txt>
When I iter over the <txt> to extract the text part with:
for txt in tree.iter('txt'):
print(txt.text)
I realized that it's the <infos> node that causes the problem. I tried to remove it:
for elt in tree.iter('txt'):
for info in elt.findall('infos'):
elt.remove(info)
But this remove the targeted text along with the <infos> node, even though it is outside.
Can someone help me understand why?
//. It is like a*?