0

I have an xml file that comes from a doc (MS Word 2003, so I can't use docx library). I'm using lxml to parse it. I can get most of the text (everything is in <txt> nodes) but there are some nodes with the following structure:

<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
 <infos>
  <bounds left="1521" top="851" width="10517" height="322"/>
 </infos>
 The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
 <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
 <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
 <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
 <LineBreak nWidth="10518"/>
 <Finish/>
</txt>

When I iter over the <txt> to extract the text part with:

for txt in tree.iter('txt'):
    print(txt.text)

I realized that it's the <infos> node that causes the problem. I tried to remove it:

for elt in tree.iter('txt'):
for info in elt.findall('infos'):
    elt.remove(info)

But this remove the targeted text along with the <infos> node, even though it is outside.

Can someone help me understand why?

3
  • 2
    Try tree.xpath('//text()') Commented Mar 12, 2015 at 11:54
  • Thanks Murali, it works perfectly. I don't understand the syntax with the //. It is like a *? Commented Mar 12, 2015 at 15:26
  • yes. its all doc's elements. Commented Mar 18, 2015 at 8:53

2 Answers 2

1

As per my comment on the Original Post, the OP solved the issue by altering xpath as follows

tree.xpath('//text()')
Sign up to request clarification or add additional context in comments.

2 Comments

Oops - I see the chain of events here now - sorry it came to me in a review queue (hence stock comment above) but I can see you've done it as a comment first. I've edited the text to make it a little bolder so that the review queue shouldn't vote it low quality.
Please explain a little bit why it should work and how it fix the issue.
0

You can extract text this way:

In [31]: txt = """<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
   ....:  <infos>
   ....:   <bounds left="1521" top="851" width="10517" height="322"/>
   ....:  </infos>
   ....:  The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
   ....:  <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
   ....:  <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
   ....:  <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
   ....:  <LineBreak nWidth="10518"/>
   ....:  <Finish/>
   ....: </txt>"""

In [32]: node = etree.fromstring(txt)

In [33]: ''.join(node.itertext())
Out[33]: '\n \n  \n \n The text I want to extract    \n \n \n \n \n \n'

UPD:

Answer suggested by Murali actually returns list, so you still need to join strings. And my solution is a little bit faster:

In [13]: %timeit ''.join(node.itertext())
100000 loops, best of 3: 11.7 µs per loop

In [14]: %timeit ''.join(node.xpath('//text()'))
10000 loops, best of 3: 26.3 µs per loop

3 Comments

Interesting solution. It works. I'll use Murali solution that seems slightly simpler. But I take note of the itertext() that may be very useful. Tks
But what exactly was the problem? Is .iter() buggy and .itertext() not?
@smci .iter() will iterate over nodes (only one <txt> in our case), .itertext() will iterate over all node's text content. Problem here is in txt.text. From lxml doc about text property: Text before the first subelement. So if you want all text, you should use .itertext().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.