How to parse HTML tags as raw text using ElementTree

Question

I have a file that has HTML within XML tags and I want that HTML as raw text, rather than have it be parsed as children of the XML tag. Here's an example:

import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")

If i try:

root.find('text').text

It returns no output

but root.find('text/p').text will return the paragraph text without the tags. I want everything within the text tag as raw text, but I can't figure out how to get this.

Community · Accepted Answer · 2017-05-23 12:29:26Z

Your solution is reasonable. An element object is the list of children. The .text attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.

There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

The list can be also build using the Python list comprehension construct, so the code would change to:

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

The .join can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the [] of the list comprehension) can be used:

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

The one-liner can be formatted to more lines to make it more readable:

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

seitzej · Accepted Answer · 2014-06-24 18:59:48Z

1

I was able to get what I wanted by appending all child elements of my text tag to a string using ET.tostring:

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

answered Jun 24, 2014 at 18:59

seitzej

1312 silver badges7 bronze badges

2 Comments

alecxe Over a year ago

Yup, what about the answer I've provided? Doesn't it look simpler?

seitzej Over a year ago

Sorry, I guess I wasn't that clear in my initial request. I want to have the "<p>" tags (or any other html tags) in the output_text string rather than only the tag's inner text.

soysal · Accepted Answer · 2022-11-02 12:33:29Z

0

Above solutions will miss initial part of your html if your content begins with text. E.g.

<root><text>This is <i>some text</i> that I want to read</text></root>

You can do that:

node = root.find('text')
output_list = [node.text] if node.text else []
output_list += [ET.tostring(child, encoding="unicode") for child in node]
output_text = ''.join(output_list)

answered Nov 2, 2022 at 12:33

soysal

3454 silver badges9 bronze badges

Collectives™ on Stack Overflow

How to parse HTML tags as raw text using ElementTree

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related