1
<xml>
<maintag>    
<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>

The xml file that I regularly parse, may have HTML tags inside of content tag as shown above.

Here how I parse the file:

parser = etree.XMLParser(remove_blank_text=False)
tree = etree.parse(StringIO(xmlFile), parser)
for item in tree.iter('maintag'):
  my_content = item.find('content').text
  #print my_content
  #output: lorem

as a result it results my_content = 'lorem' instead of -which i'd like to see- ' lorem < br >ipsum< /br> < strong > dolor sit < /strong > and so on'

How can I read the content as ' lorem < br>ipsum< /br> < strong > dolor sit < /strong > and so on'?

Note: content tag may have another html tags instead of strong. And may not have them at all.

1
  • 1
    The HTML tags are just XML tags with names identical to HML tags. That doesn't make them HTML. In HTML, the <br /> tag is empty for example. Commented Nov 7, 2013 at 12:20

1 Answer 1

1
from lxml import etree
root = etree.fromstring('''<xml>
<maintag>    
<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>''')
for content in root.xpath('.//maintag/content'):
    print etree.tostring(content)

prints

<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>

....
for content in root.xpath('.//maintag/content'):
    print ''.join(child if isinstance(child, basestring) else etree.tostring(child)
                  for child in content.xpath('*|text()'))

prints

 lorem <br>ipsum</br>  <strong> dolor sit </strong> and so on  and so on
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.