How do I get the full XML or HTML content of an element using ElementTree?

Question

That is, all text and subtags, without the tag of an element itself?

Having

<p>blah <b>bleh</b> blih</p>

I want

blah <b>bleh</b> blih

element.text returns "blah " and etree.tostring(element) returns:

<p>blah <b>bleh</b> blih</p>

S.Lott · Accepted Answer · 2008-12-19 20:48:37Z

11

ElementTree works perfectly, you have to assemble the answer yourself. Something like this...

"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )

Thanks to JV amd PEZ for pointing out the errors.

Edit.

>>> import xml.etree.ElementTree as xml
>>> s= '<p>blah <b>bleh</b> blih</p>\n'
>>> t=xml.fromstring(s)
>>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
'blah <b>bleh</b> blih'
>>>

Tail not needed.

edited Dec 19, 2008 at 20:48

answered Dec 19, 2008 at 11:21

S.Lott

393k83 gold badges520 silver badges791 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

JV. Over a year ago

Just pointing out a typo - method name - "finall" which I think should have been "findall". Even if findall is used it results in this pastebin.com/f6de9a841. Please revise your answer.

Pablo Fernandez Over a year ago

I'm doing something similar to that, but with a for look. You are actually missing the tail.

S.Lott Over a year ago

The tail is the extra whitespace after the closing tag of the construct.

Pablo Fernandez · Accepted Answer · 2008-12-20 17:01:14Z

9

This is the solution I ended up using:

def element_to_string(element):
    s = element.text or ""
    for sub_element in element:
        s += etree.tostring(sub_element)
    s += element.tail
    return s

edited Dec 20, 2008 at 17:01

answered Dec 19, 2008 at 17:27

Pablo Fernandez

289k141 gold badges405 silver badges653 bronze badges

5 Comments

PEZ Over a year ago

That would fail when there's no text or no tail, wouldn't it?

Pablo Fernandez Over a year ago

PEZ, yes, it fails when there's no text, just found it by running my code and fixed it. I have many instances of no tail and that doesn't fail. Not sure why.

cdleary Over a year ago

Just a nitpick: += on strings is less performant. It's best to accumulate a list of strings and ''.join it at the end.

dbader Over a year ago

You may want to recurse and call element_to_string on the sub element again to capture all of the text, i.e for sub_element in element: s += element_to_string(sub_element)

user18309290 Over a year ago

For Python 3: ET.tostring(sub_element, encoding='unicode').

mike rodent · Accepted Answer · 2021-09-12 19:54:20Z

These are good answers, which answer the OP's question, particularly if the question is confined to HTML. But documents are inherently messy, and the depth of element nesting is usually impossible to predict.

To simulate DOM's getTextContent() you would have to use a (very) simple recursive mechanism.

To get just the bare text:

def get_deep_text( element ):
    text = element.text or ''
    for subelement in element:
        text += get_deep_text( subelement )
    text += element.tail or ''
    return text
print( get_deep_text( element_of_interest ))

To get all the details about the boundaries between raw text:

class holder: pass # this is just a way of creating a holder object
holder.element_count = 0
def get_deep_text_w_boundaries(element, depth = 0):
    holder.element_count += 1
    element_no = holder.element_count 
    indent = depth * '  '
    text1 = f'{indent}(el {element_no} tag {element.tag}: text |{element.text or ""}| - attribs: {element.attrib})' 
    print(text1)
    for subelement in element:
        get_deep_text_w_boundaries(subelement, depth + 1)
    text2 = f'{indent}(el {element_no} tag {element.tag} - tail: |{element.tail or ""}|)' 
    print(text2)
get_deep_text_w_boundaries(etree_element)

Example output:

(el 1 tag source: text |DEVANT LE | - attribs: {})
  (el 2 tag g: text |TRIBUNAL JUDICIAIRE| - attribs: {'style_no': '3'})
  (el 2 tag g - tail: ||)
(el 1 tag source - tail: | DE VERSAILLES|)

PEZ · Accepted Answer · 2008-12-19 11:56:30Z

2

I doubt ElementTree is the thing to use for this. But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment:

 re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element))

answered Dec 19, 2008 at 11:56

PEZ

17k7 gold badges47 silver badges66 bronze badges

Comments

RayLuo · Accepted Answer · 2018-02-21 01:32:18Z

2

Most of the answers here are based on the XML parser ElementTree, even PEZ's regex-based answer still partially relies on ElementTree.

All those are good and suitable for most use cases but, just for the sake of completeness, it is worth noting that, ElementTree.tostring(...) will give you an equivalent snippet, but not always identical to the original payload. If, for some very rare reason, that you want to extract the content as-is, you have to use a pure regex-based solution. This example is how I use regex-based solution.

answered Feb 21, 2018 at 1:32

RayLuo

19.6k6 gold badges93 silver badges80 bronze badges

Comments

Arup · Accepted Answer · 2020-07-21 00:06:21Z

0

This answer is slightly modified of Pupeno's reply. Here I added encoding type into "tostring". This issue took many hours of mine. I hope this small correction will help others.

def element_to_string(element):
        s = element.text or ""
        for sub_element in element:
            s += ElementTree.tostring(sub_element, encoding='unicode')
        s += element.tail
        return s

answered Jul 21, 2020 at 0:06

Arup

12 bronze badges

Comments

Till · Accepted Answer · 2008-12-19 11:23:59Z

-4

No idea if an external library might be an option, but anyway -- assuming there is one <p> with this text on the page, a jQuery-solution would be:

alert($('p').html()); // returns blah <b>bleh</b> blih

answered Dec 19, 2008 at 11:23

Till

22.4k4 gold badges61 silver badges89 bronze badges

Collectives™ on Stack Overflow

How do I get the full XML or HTML content of an element using ElementTree?

7 Answers 7

3 Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

5 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related