Cannot extract text from xml in python

Question

I have an xml file that comes from a doc (MS Word 2003, so I can't use docx library). I'm using lxml to parse it. I can get most of the text (everything is in <txt> nodes) but there are some nodes with the following structure:

<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
 <infos>
  <bounds left="1521" top="851" width="10517" height="322"/>
 </infos>
 The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
 <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
 <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
 <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
 <LineBreak nWidth="10518"/>
 <Finish/>
</txt>

When I iter over the <txt> to extract the text part with:

for txt in tree.iter('txt'):
    print(txt.text)

I realized that it's the <infos> node that causes the problem. I tried to remove it:

for elt in tree.iter('txt'):
for info in elt.findall('infos'):
    elt.remove(info)

But this remove the targeted text along with the <infos> node, even though it is outside.

Can someone help me understand why?

Thanks Murali, it works perfectly. I don't understand the syntax with the //. It is like a *? — bosonfute
– bosonfute, Commented Mar 12, 2015 at 15:26

J Richard Snape · Accepted Answer · 2015-03-18 11:52:07Z

1

As per my comment on the Original Post, the OP solved the issue by altering xpath as follows

tree.xpath('//text()')

edited Mar 18, 2015 at 11:52

J Richard Snape

20.4k5 gold badges55 silver badges85 bronze badges

answered Mar 18, 2015 at 8:53

Murali Mopuru

6,7205 gold badges37 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

J Richard Snape Over a year ago

Oops - I see the chain of events here now - sorry it came to me in a review queue (hence stock comment above) but I can see you've done it as a comment first. I've edited the text to make it a little bolder so that the review queue shouldn't vote it low quality.

Michele d'Amico Over a year ago

Please explain a little bit why it should work and how it fix the issue.

Mikhail Pyrev · Accepted Answer · 2015-03-19 06:45:54Z

0

You can extract text this way:

In [31]: txt = """<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
   ....:  <infos>
   ....:   <bounds left="1521" top="851" width="10517" height="322"/>
   ....:  </infos>
   ....:  The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
   ....:  <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
   ....:  <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
   ....:  <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
   ....:  <LineBreak nWidth="10518"/>
   ....:  <Finish/>
   ....: </txt>"""

In [32]: node = etree.fromstring(txt)

In [33]: ''.join(node.itertext())
Out[33]: '\n \n  \n \n The text I want to extract    \n \n \n \n \n \n'

UPD:

Answer suggested by Murali actually returns list, so you still need to join strings. And my solution is a little bit faster:

In [13]: %timeit ''.join(node.itertext())
100000 loops, best of 3: 11.7 µs per loop

In [14]: %timeit ''.join(node.xpath('//text()'))
10000 loops, best of 3: 26.3 µs per loop

edited Mar 19, 2015 at 6:45

answered Mar 12, 2015 at 10:58

Mikhail Pyrev

3113 silver badges7 bronze badges

3 Comments

bosonfute Over a year ago

Interesting solution. It works. I'll use Murali solution that seems slightly simpler. But I take note of the itertext() that may be very useful. Tks

smci Over a year ago

But what exactly was the problem? Is .iter() buggy and .itertext() not?

Mikhail Pyrev Over a year ago

@smci .iter() will iterate over nodes (only one <txt> in our case), .itertext() will iterate over all node's text content. Problem here is in txt.text. From lxml doc about text property: Text before the first subelement. So if you want all text, you should use .itertext().

Collectives™ on Stack Overflow

Cannot extract text from xml in python

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related