lxml xml parsing with html tags inside xml tags

Question

<xml>
<maintag>    
<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>

The xml file that I regularly parse, may have HTML tags inside of content tag as shown above.

Here how I parse the file:

parser = etree.XMLParser(remove_blank_text=False)
tree = etree.parse(StringIO(xmlFile), parser)
for item in tree.iter('maintag'):
  my_content = item.find('content').text
  #print my_content
  #output: lorem

as a result it results my_content = 'lorem' instead of -which i'd like to see- ' lorem ipsum dolor sit and so on'

How can I read the content as ' lorem ipsum dolor sit and so on'?

Note: content tag may have another html tags instead of strong. And may not have them at all.

The HTML tags are just XML tags with names identical to HML tags. That doesn't make them HTML. In HTML, the   tag is empty for example. — Martijn Pieters
– Martijn Pieters, Commented Nov 7, 2013 at 12:20

falsetru · Accepted Answer · 2013-11-07 12:24:49Z

1

from lxml import etree
root = etree.fromstring('''<xml>
<maintag>    
<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>''')
for content in root.xpath('.//maintag/content'):
    print etree.tostring(content)

prints

<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>

....
for content in root.xpath('.//maintag/content'):
    print ''.join(child if isinstance(child, basestring) else etree.tostring(child)
                  for child in content.xpath('*|text()'))

prints

 lorem <br>ipsum</br>  <strong> dolor sit </strong> and so on  and so on

answered Nov 7, 2013 at 12:24

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

lxml xml parsing with html tags inside xml tags

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related