In Python, Parsing Custom XML Tags Without Parsing HTML

Question

I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)

E.g. I have an XML file that looks like

<myTag1 myAttrib="value">
  <myTag2>
    <p>My what a lovely day.</p>
  </myTag2>
</myTag1>

I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.

EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:

root = ET.fromstring(xmlstring)
root.tag  # returns 'myTag1'
root[0].tag  # returns 'myTag2'
root[0].text  # returns None, but I want it to return the HTML string

The HTML string I want has been parsed and is stored as a tag and text:

root[0][0].tag  # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text  # returns 'My ... day.'

But really I'd like to be able to do something like this...

root[0].unparsedtext  # returns '<p>My ... day.</p>'

SOLUTION:

har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:

def _getInner(element):
    if element.text == None:
        textStr = ''
    else:
        textStr = element.text
    return textStr + ''.join(ET.tostring(e) for e in element)

Then if

element = ET.fromstring('<myTag>Let us be <b>gratuitous</b> with tags</myTag>')

the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:

''.join(ET.tostring(e) for e in element)  # returns '<b>gratuitous</b> with tags'

_getInner(element)  # returns 'Let us be <b>gratuitous</b> with tags'

What do you mean by parse the custom XML tags without parsing any HTML content whatsoever ? — Anand S Kumar
– Anand S Kumar, Commented Jul 19, 2015 at 4:05
An XML file is just like an HTML file, except you can define your own tags. I think the only possible way to achieve this is if you add several if-else conditions for the tags you want to ignore or parse. Don't think there are any pre-built libraries in python to ignore HTML tags. — Vini.g.fer
– Vini.g.fer, Commented Jul 19, 2015 at 4:12
@AnandSKumar, I want to be able to access unparsed HTML that lives inside of XML tags that I want to be parsed. I updated my post to try to be a bit more clear about this. — user2197148
– user2197148, Commented Jul 19, 2015 at 4:28
@user2197148 I am guessing you are mistaken, even if your file was not html parsed, still the <p> is a valid xml tag, and it would be treated as an xml element. — Anand S Kumar
– Anand S Kumar, Commented Jul 19, 2015 at 4:31

har07 · Accepted Answer · 2015-07-19 06:28:03Z

2

I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :

import xml.etree.ElementTree as ET

def getUnparsedContent(element):
    return ''.join(ET.tostring(e) for e in element)

xmlstring = """<myTag1 myAttrib="value">
  <myTag2>
    <p>My what a lovely day.</p>
  </myTag2>
</myTag1>"""

root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))

output :

<p>My what a lovely day.</p>

answered Jul 19, 2015 at 6:28

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mzjn Over a year ago

This works, but it seems strange to talk about "unparsed content". fromstring() parses everything; there is no part of the XML that is "unparsed".

har07 Over a year ago

I can see your point, I was just trying to explain from the OP's point of view (personally I call it inner XML or inner HTML, influenced by property names for getting the same in .NET XML/HTML parser)

eleventhend · Accepted Answer · 2015-07-19 04:35:22Z

You should be able to implement this through the built-in minidom xml parser.

from xml.dom import minidom

xmldoc = minidom.parse("document.xml")
rootNode = xmldoc.firstChild
firstNode = rootNode.childNodes[0]

In your example case, firstNode would end up as:

<p>My what a lovely day.</p>

Note that minidom (and probably any other xml-parsing library you might use) won't recognize HTML by default. This is by design, because XML documents do not have predefined tags.

You could then use a series of if or try statements to determine whether you have reached a HTML formatted node while extracting data:

for i in range (0, len(rootNode))
    rowNode = rootNode.childNodes[i]
    if "<p>" in rowNode:
         #this is an html-formatted node: extract the value and continue

Collectives™ on Stack Overflow

In Python, Parsing Custom XML Tags Without Parsing HTML

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related