3

I am trying to parse big files that have html embedded or included inside XML. I've been able to extract the whole content of the main xml but I am not able to access to the content of the embedded html.

For example, I would have a file of this structure:

<TitleContentExtra>Part 1</TitleContentExtra><SubTitle /><TitleOriginal /><Abstract /><FullText>
&lt;p&gt;&lt;strong class="grey" id="authordate"&gt; &lt;span class="gray pointer"&gt;Argh, &lt;em&gt;et al.&lt;/em&gt; 2001 [+] &lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;div class="bkg_gray" id="authordate2_container" style="display: none;"&gt;
&lt;p&gt;It is a big product [some_product]:[bib2bib]&lt;/p&gt;
&lt;ul class="ul_style_1"&gt;
    &lt;li&gt;More text goes here &lt;/li&gt;
    &lt;li&gt;Why do I have to do it? &lt;strong class="gray"&gt;Some text goes there&lt;/strong&gt; &lt;/li&gt;
</FullText><FullTextOriginal /><FullTextComment>
&lt;ol class="ol_style_3" id="notes_container"&gt;
    &lt;li&gt;&lt;span id="note_a"&gt;&lt;a name="notea"&gt;&lt;/a&gt;Extra information here.&lt;/span&gt;&lt;/li&gt;
</FullTextComment>

My code in Python 3 would be something like that:

try:
    from lxml import etree as ET

except ImportError:
    import xml.etree.ElementTree as ET

tree = ET.ElementTree(file='Files\\xml_File.xml')
root = tree.getroot()

for child in root:
    print (child.tag, child.attrib)

print ('\n------------------\n')
for elem in tree.iter():
    #print (elem.tag, 'atrribute: ',  elem.attrib)
    for value in elem.getiterator(tag=elem.tag):
        #print (value.text)
        extags=str(value.text)
        try:
            xmldata=ET.fromstring(extags)
            print (xmldata.tags)
        except:
            print ('There is an error: :', extags)

I am not able to parse th embedded html/xml text. I've tried many options with soupparser, parse,... But none works, or I've not been able to make them work.

I need to parse the whole xml file to later get a list of all tags and attributes for further process of them.

1 Answer 1

1

Well, that embedded HTML of yours is XML-ecaped - it should be obvious you have to unescape that before trying to parse it as XML.

Python3 does contain a shortcut to un-escaping in the html stdlib module:

    import html
    ...
    extags=html.unescape(value.text)
    try:
        xmldata=ET.fromstring(extags)
        print (xmldata.tags)
    except:
        print ('There is an error: :', extags)
    ...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.