Parse html(or other xml) inside xml with Python 3

Question

I am trying to parse big files that have html embedded or included inside XML. I've been able to extract the whole content of the main xml but I am not able to access to the content of the embedded html.

For example, I would have a file of this structure:

<TitleContentExtra>Part 1</TitleContentExtra><SubTitle /><TitleOriginal /><Abstract /><FullText>
&lt;p&gt;&lt;strong class="grey" id="authordate"&gt; &lt;span class="gray pointer"&gt;Argh, &lt;em&gt;et al.&lt;/em&gt; 2001 [+] &lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;div class="bkg_gray" id="authordate2_container" style="display: none;"&gt;
&lt;p&gt;It is a big product [some_product]:[bib2bib]&lt;/p&gt;
&lt;ul class="ul_style_1"&gt;
    &lt;li&gt;More text goes here &lt;/li&gt;
    &lt;li&gt;Why do I have to do it? &lt;strong class="gray"&gt;Some text goes there&lt;/strong&gt; &lt;/li&gt;
</FullText><FullTextOriginal /><FullTextComment>
&lt;ol class="ol_style_3" id="notes_container"&gt;
    &lt;li&gt;&lt;span id="note_a"&gt;&lt;a name="notea"&gt;&lt;/a&gt;Extra information here.&lt;/span&gt;&lt;/li&gt;
</FullTextComment>

My code in Python 3 would be something like that:

try:
    from lxml import etree as ET

except ImportError:
    import xml.etree.ElementTree as ET

tree = ET.ElementTree(file='Files\\xml_File.xml')
root = tree.getroot()

for child in root:
    print (child.tag, child.attrib)

print ('\n------------------\n')
for elem in tree.iter():
    #print (elem.tag, 'atrribute: ',  elem.attrib)
    for value in elem.getiterator(tag=elem.tag):
        #print (value.text)
        extags=str(value.text)
        try:
            xmldata=ET.fromstring(extags)
            print (xmldata.tags)
        except:
            print ('There is an error: :', extags)

I am not able to parse th embedded html/xml text. I've tried many options with soupparser, parse,... But none works, or I've not been able to make them work.

I need to parse the whole xml file to later get a list of all tags and attributes for further process of them.

jsbueno · Accepted Answer · 2016-01-29 12:45:22Z

1

Well, that embedded HTML of yours is XML-ecaped - it should be obvious you have to unescape that before trying to parse it as XML.

Python3 does contain a shortcut to un-escaping in the html stdlib module:

    import html
    ...
    extags=html.unescape(value.text)
    try:
        xmldata=ET.fromstring(extags)
        print (xmldata.tags)
    except:
        print ('There is an error: :', extags)
    ...

answered Jan 29, 2016 at 12:45

jsbueno

113k11 gold badges159 silver badges239 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parse html(or other xml) inside xml with Python 3

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related