python xml.etree.ElementTree remove empty tag in the middle of text

Question

I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

When I do

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

I'm only able to grab the part that comes before the empty tag <TIP CONTENT=""/>
I tried to delete this tag before getting the rest of the text.
I did :

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

But this is not working.
None of <BlockText> and <TIP> are direct children of root.

Thank you.

Use itertext(): stackoverflow.com/q/19369901/407651

mzjn
– mzjn

2020-02-20 16:45:17 +00:00
Commented Feb 20, 2020 at 16:45 — mzjn
– mzjn, Commented Feb 20, 2020 at 16:45
thank you! that's exactly what I needed!

MMM
– MMM

2020-02-20 17:05:52 +00:00
Commented Feb 20, 2020 at 17:05 — MMM
– MMM, Commented Feb 20, 2020 at 17:05

MMM · Accepted Answer · 2020-02-20 17:08:39Z

1

Ok this is what ended up working for me :

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

But I still can't get the text as a whole block (same order). I can get all the BlockText tags and all the TIP tags but not together.

Update :
I used :

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

edited Feb 20, 2020 at 17:08

answered Feb 20, 2020 at 15:18

MMM

3051 gold badge4 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

egur · Accepted Answer · 2020-02-20 14:34:55Z

0

The text After <TIP CONTENT=""/> belongs to its own tail not the text of the BlockText tag.

elem.text is the text following the open tag. elem.tail is the text following the close tag. Usually whitespace but in this case it's has actual text.

answered Feb 20, 2020 at 14:34

egur

8,0102 gold badges30 silver badges47 bronze badges

2 Comments

MMM Over a year ago

is there a way to get the text as a whole block because I can get each tag but it messes the order of the sentences.

egur Over a year ago

From XML point of view, the text you want belongs to 2 different elements. Before deleting the child element, grab its tail text and append it to the parent's text field.

dabingsou · Accepted Answer · 2020-02-26 00:16:53Z

Another solution for reference only

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

Result:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

Collectives™ on Stack Overflow

python xml.etree.ElementTree remove empty tag in the middle of text

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related