I am trying to parse big files that have html embedded or included inside XML. I've been able to extract the whole content of the main xml but I am not able to access to the content of the embedded html.
For example, I would have a file of this structure:
<TitleContentExtra>Part 1</TitleContentExtra><SubTitle /><TitleOriginal /><Abstract /><FullText>
<p><strong class="grey" id="authordate"> <span class="gray pointer">Argh, <em>et al.</em> 2001 [+] </span></strong></p>
<div class="bkg_gray" id="authordate2_container" style="display: none;">
<p>It is a big product [some_product]:[bib2bib]</p>
<ul class="ul_style_1">
<li>More text goes here </li>
<li>Why do I have to do it? <strong class="gray">Some text goes there</strong> </li>
</FullText><FullTextOriginal /><FullTextComment>
<ol class="ol_style_3" id="notes_container">
<li><span id="note_a"><a name="notea"></a>Extra information here.</span></li>
</FullTextComment>
My code in Python 3 would be something like that:
try:
from lxml import etree as ET
except ImportError:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='Files\\xml_File.xml')
root = tree.getroot()
for child in root:
print (child.tag, child.attrib)
print ('\n------------------\n')
for elem in tree.iter():
#print (elem.tag, 'atrribute: ', elem.attrib)
for value in elem.getiterator(tag=elem.tag):
#print (value.text)
extags=str(value.text)
try:
xmldata=ET.fromstring(extags)
print (xmldata.tags)
except:
print ('There is an error: :', extags)
I am not able to parse th embedded html/xml text. I've tried many options with soupparser, parse,... But none works, or I've not been able to make them work.
I need to parse the whole xml file to later get a list of all tags and attributes for further process of them.