0

I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

When I do

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

I'm only able to grab the part that comes before the empty tag <TIP CONTENT="­"/>
I tried to delete this tag before getting the rest of the text.
I did :

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

But this is not working.
None of <BlockText> and <TIP> are direct children of root.


Thank you.

2
  • 1
    Use itertext(): stackoverflow.com/q/19369901/407651 Commented Feb 20, 2020 at 16:45
  • thank you! that's exactly what I needed! Commented Feb 20, 2020 at 17:05

3 Answers 3

1

Ok this is what ended up working for me :

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

But I still can't get the text as a whole block (same order). I can get all the BlockText tags and all the TIP tags but not together.

Update :
I used :

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())
Sign up to request clarification or add additional context in comments.

Comments

0

The text After <TIP CONTENT="­"/> belongs to its own tail not the text of the BlockText tag.

elem.text is the text following the open tag. elem.tail is the text following the close tag. Usually whitespace but in this case it's has actual text.

2 Comments

is there a way to get the text as a whole block because I can get each tag but it messes the order of the sentences.
From XML point of view, the text you want belongs to 2 different elements. Before deleting the child element, grab its tail text and append it to the parent's text field.
0

Another solution for reference only

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

Result:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.