0

I have this XML file :

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

<Component>

<Custom/>
<ID>1</ID>
<LongDescription>
<html><html> <head> <style type="text/css"> <!-- .style9 { color: #ffff33; background-color: #ff00ff } .style8 { color: #990099; background-color: #66ffcc } .style7 { color: #0066cc; background-color: #ccffcc } .style6 { color: #009900; background-color: #ffffcc } .style11 { color: #000066; background-color: #ccffcc } .style5 { color: #cc0033; background-color: #99ff99 } .style10 { color: #99ff99; background-color: #00cccc } .style4 { color: #cc0033; background-color: #ccffff } .style3 { color: #0000dd; background-color: teal } .style2 { color: #0000cc; background-color: aqua } .style1 { color: blue; background-color: silver } .style0 { color: #000099; background-color: #ffffcc } --> </style> </head> <body> </body> </html> </html>
</LongDescription>
<Name>ip_bridge</Name>
</component>

I am reading this file using the library xml.etree.ElementTree as follows :

def getTokens(xml_string_file):
tokensList = []
tree = ET.parse(xml_string_file)
root = tree.getroot()
tokensList.append('<component>')
for child in root: 
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL):
        tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
tokensList.append('</component>')
return tokensList

with the function extractTags

def extractTags(root):
tokensList = []
for child in root:
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL): #To extract the children of the children
            tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
return tokensList

I get as a result the tokens list ['<omponent>', '<custom>', '', '</custom>', '<ID>', '1', '</ID>', '<LongDescription>', '<html>', '</html>', '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</component>'] I want to extract also what is between the html tags as one token (one text).

9
  • Can you post your expected output? Commented Jun 1, 2018 at 10:28
  • expected output ['<component>', '<custom>', '', '</custom>', '<ID>', '1', '</ID>', '<LongDescription>', '<html>', '</html>','<html><head><style>...</html>' ,'</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</component>'] Commented Jun 1, 2018 at 10:31
  • @Rakesh I forgot to add the extractTags function also. It's updated now in the post. Commented Jun 1, 2018 at 10:49
  • This looks like a very complicated approach for creating a replica of the original tree. The output you create has nothing the actual XML tree doesn't have; I'm convinced it would be much simpler to skip creating this strange "token list" and work with the XML tree directly. What's the purpose or goal you want to achieve? Commented Jun 1, 2018 at 11:14
  • The purpose of the project is to regenerate an unseen xml file using Machine Learning methods. The token list is first encoded using the one hot Encder. the encoded vector is then fed to the autoEncoder model so we can regenerate using the decoder layer specifically. So I need that tokens list. After regenrating the same tokens list , it will be written to an xml file. Commented Jun 1, 2018 at 11:23

1 Answer 1

0

I would suggest a simple recursive generator that traverses the tree and yields tokens.

These can be put into a list very easily through a list comprehension.

from io import StringIO

xml = """<Component>
    <Custom/>
    <ID>1</ID>
    <LongDescription>
        <html>
            <html>
                <head>
                    <style type="text/css">
                        <!-- .style9 { color: #ffff33; } ... --> 
                    </style>
                </head>
                <body>
                </body>
            </html>
        </html>
    </LongDescription>
    <Name>ip_bridge</Name>
</Component>"""
xml_string_file = StringIO(xml)

# -----------------------------------------------------------------------
import xml.etree.ElementTree as ET

def tokenize_tree(element):
    yield '<%s>' % element.tag 
    yield element.text if element.text else ''
    for child in element:
        yield from tokenize_tree(child)
    yield '</%s>' % element.tag 

tree = ET.parse(xml_string_file)    

token_list = [token for token in tokenize_tree(tree.getroot())]
print(token_list)

The output for me is:

['<Component>', '\n    ', '<Custom>', '', '</Custom>', '<ID>', '1', '</ID>', 
 '<LongDescription>', '\n        ', '<html>', '\n            ', '<html>', 
 '\n                ',  '<head>', '\n                    ',  '<style>', 
 '\n                         \n                    ', '</style>', '</head>', 
 '<body>', '\n                ', '</body>', '</html>', '</html>', 
 '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</Component>']

You can handle whitespace-only text nodes and comments (such as the one in the <style> element) as you see fit. For example by doing:

if element.text and element.text.strip():
    yield element.text.strip()

For text node processing with ElementTree, look at Python element tree - extract text from element, stripping tags - you might instead want to add something like:

for text in element.itertext():
    yield text

to the function above.

For HTML in general, which will have text nodes and element nodes intermixed, see Python ElementTree - iterate through child nodes and text in order

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.