How to parse XML file with xml.etree.ElementTree that have HTML content in its child

Question

I have this XML file :

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

<Component>

<Custom/>
<ID>1</ID>
<LongDescription>
<html><html> <head> <style type="text/css"> <!-- .style9 { color: #ffff33; background-color: #ff00ff } .style8 { color: #990099; background-color: #66ffcc } .style7 { color: #0066cc; background-color: #ccffcc } .style6 { color: #009900; background-color: #ffffcc } .style11 { color: #000066; background-color: #ccffcc } .style5 { color: #cc0033; background-color: #99ff99 } .style10 { color: #99ff99; background-color: #00cccc } .style4 { color: #cc0033; background-color: #ccffff } .style3 { color: #0000dd; background-color: teal } .style2 { color: #0000cc; background-color: aqua } .style1 { color: blue; background-color: silver } .style0 { color: #000099; background-color: #ffffcc } --> </style> </head> <body> </body> </html> </html>
</LongDescription>
<Name>ip_bridge</Name>
</component>

I am reading this file using the library xml.etree.ElementTree as follows :

def getTokens(xml_string_file):
tokensList = []
tree = ET.parse(xml_string_file)
root = tree.getroot()
tokensList.append('<component>')
for child in root: 
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL):
        tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
tokensList.append('</component>')
return tokensList

with the function extractTags

def extractTags(root):
tokensList = []
for child in root:
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL): #To extract the children of the children
            tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
return tokensList

I get as a result the tokens list ['<omponent>', '<custom>', '', '</custom>', '<ID>', '1', '</ID>', '<LongDescription>', '<html>', '</html>', '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</component>'] I want to extract also what is between the html tags as one token (one text).

expected output ['<component>', '<custom>', '', '</custom>', '<ID>', '1', '</ID>', '<LongDescription>', '<html>', '</html>','<html><head><style>...</html>' ,'</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</component>'] — Emna Jaoua
– Emna Jaoua, Commented Jun 1, 2018 at 10:31
@Rakesh I forgot to add the extractTags function also. It's updated now in the post. — Emna Jaoua
– Emna Jaoua, Commented Jun 1, 2018 at 10:49
This looks like a very complicated approach for creating a replica of the original tree. The output you create has nothing the actual XML tree doesn't have; I'm convinced it would be much simpler to skip creating this strange "token list" and work with the XML tree directly. What's the purpose or goal you want to achieve? — Tomalak
– Tomalak, Commented Jun 1, 2018 at 11:14
The purpose of the project is to regenerate an unseen xml file using Machine Learning methods. The token list is first encoded using the one hot Encder. the encoded vector is then fed to the autoEncoder model so we can regenerate using the decoder layer specifically. So I need that tokens list. After regenrating the same tokens list , it will be written to an xml file. — Emna Jaoua
– Emna Jaoua, Commented Jun 1, 2018 at 11:23

Tomalak · Accepted Answer · 2018-06-01 11:57:01Z

I would suggest a simple recursive generator that traverses the tree and yields tokens.

These can be put into a list very easily through a list comprehension.

from io import StringIO

xml = """<Component>
    <Custom/>
    <ID>1</ID>
    <LongDescription>
        <html>
            <html>
                <head>
                    <style type="text/css">
                        <!-- .style9 { color: #ffff33; } ... --> 
                    </style>
                </head>
                <body>
                </body>
            </html>
        </html>
    </LongDescription>
    <Name>ip_bridge</Name>
</Component>"""
xml_string_file = StringIO(xml)

# -----------------------------------------------------------------------
import xml.etree.ElementTree as ET

def tokenize_tree(element):
    yield '<%s>' % element.tag 
    yield element.text if element.text else ''
    for child in element:
        yield from tokenize_tree(child)
    yield '</%s>' % element.tag 

tree = ET.parse(xml_string_file)    

token_list = [token for token in tokenize_tree(tree.getroot())]
print(token_list)

The output for me is:

['<Component>', '\n    ', '<Custom>', '', '</Custom>', '<ID>', '1', '</ID>', 
 '<LongDescription>', '\n        ', '<html>', '\n            ', '<html>', 
 '\n                ',  '<head>', '\n                    ',  '<style>', 
 '\n                         \n                    ', '</style>', '</head>', 
 '<body>', '\n                ', '</body>', '</html>', '</html>', 
 '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</Component>']

You can handle whitespace-only text nodes and comments (such as the one in the <style> element) as you see fit. For example by doing:

if element.text and element.text.strip():
    yield element.text.strip()

For text node processing with ElementTree, look at Python element tree - extract text from element, stripping tags - you might instead want to add something like:

for text in element.itertext():
    yield text

to the function above.

For HTML in general, which will have text nodes and element nodes intermixed, see Python ElementTree - iterate through child nodes and text in order

Collectives™ on Stack Overflow

How to parse XML file with xml.etree.ElementTree that have HTML content in its child

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related