Parse xml data with multiple roots in python

Question

I'm making an API call that returns multiple xml responses as so-

<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BESAPI.xsd">
        <Action Resource="https://www.example.com">
                <Name> ABC </Name>
                <ID> 123 </ID>
        </Action>
</BESAPI>

<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BESAPI.xsd">
        <Action Resource="https://www.example.com">
                <Name> DEF </Name>
                <ID> 456 </ID>
        </Action>
</BESAPI>

<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BESAPI.xsd">
        <Action Resource="https://www.example.com">
                <Name> GHI </Name>
                <ID> 789 </ID>
        </Action>
</BESAPI>

I want to parse all the action IDs from the tag and add them to a list-

import xml.etree.ElementTree as ET
url = ""
payload = ""
headers = {}
response = requests.post(url, headers=headers, data=payload)

root = ET.fromstring(response.content)
actionidlist = []
for elem in root.iter('Action'):
    for subelem in elem.iter('ID'):
        actionidlist.append(subelem.text)
        print(actionidlist)

I get errors though because there are multiple roots. How do I parse this?

Edit: By errors I mean, actionidlist seems to only contain the last ID and not the rest of the IDs.

Can you show the import and parse in your code ? We don't know if you're using the std xml module, or lxml, for example. Also, you say "I get errors" but you don't show them, is it in the parsing phase ? or when calling root.iter() ?. Please include the full stacktrace — joao
– joao, Commented Feb 12, 2021 at 7:32
Wrap the response in a single root element in order to make it well-formed XML. — mzjn
– mzjn, Commented Feb 12, 2021 at 7:55
I would carefully read API instructions. Are you sending multiple params? Hard to believe an API will return a non well-informed XML response. Is it embedded in larger XML? Get in touch with maintainers. — Parfait
– Parfait, Commented Feb 13, 2021 at 0:25

joao · Accepted Answer · 2021-02-12 23:29:06Z

1

ET.fromstring() only parses one XML section, if you try to parse your entire input data, with multiple roots, you get the error:

xml.etree.ElementTree.ParseError: junk after document element: line 9, column 0

So I suggest pre-processing the input data, to split it into a list of xml responses, then parse each one in turn:

import xml.etree.ElementTree as ET
url = ""
payload = ""
headers = {}
response = requests.post(url, headers=headers, data=payload)

# Split the input data into a list of strings (xml sections)
xml_sections = ['']
for line in response.content.splitlines():
    if len(line) != 0:
        xml_sections[-1] += line + '\n'
    else:
        xml_sections.append('')

# Parse each XML section separately
actionidlist = []
for s in xml_sections:
    root = ET.fromstring(s)
    for elem in root.iter('Action'):
        for subelem in elem.iter('ID'):
            actionidlist.append(subelem.text)
print(actionidlist)

This produces the following output:

[' 123 ', ' 456 ', ' 789 ']

answered Feb 12, 2021 at 23:29

joao

2,3032 gold badges13 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jenhenry Over a year ago

Perfect! Splitting the xml responses worked! Thank you!

Harshal Taware · Accepted Answer · 2021-02-12 07:30:21Z

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
file = "filepath/<xml_file_name.xml>"
schema_path = "filepath/<xml_schame_name.xml>"
"""
"""
XSD Schema
schema_path =
<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:noNamespaceSchemaLocation="BESAPI.xsd">
        <Action Resource="https://www.example.com">
                <Name> string </Name>
                <ID> INT </ID>
        </Action>
</BESAPI>

<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:noNamespaceSchemaLocation="BESAPI.xsd">
        <Action Resource="https://www.example.com">
                <Name> string </Name>
                <ID> INT </ID>
        </Action>
</BESAPI>
"""


df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Resource').load(schema_path)
df =sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Resource').load(path,schema=df_schema.schema)
#display(df)
df.createOrReplaceTempView("temptable")
structured_df =sqlContext.sql("select concat_ws(', ',Action.Name) as Name,concat_ws(', ',Action.ID) as ID from temptable")
display(structured_df)

Collectives™ on Stack Overflow

Parse xml data with multiple roots in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related