How can we convert a nested XML to CSV in Python Dynamically, Nested XML may contain array of values as well?

Question

Sharing Sample XML file. Need to convert this fie to CSV, even if extra tags are added in this file. {without using tag names}. And XML file tag names should be used as column names while converting it to CSV}

Example Data:

<?xml version="1.0" encoding="UTF-8"?>

<Food>
    <Info>
        <Msg>Food Store items.</Msg>
    </Info>

    <store slNo="1">
        <foodItem>meat</foodItem>
        <price>200</price>
        <quantity>1kg</quantity>
        <discount>7%</discount>
    </store>

    <store slNo="2">
        <foodItem>fish</foodItem>
        <price>150</price>
        <quantity>1kg</quantity>
        <discount>5%</discount>
    </store>

    <store slNo="3">
        <foodItem>egg</foodItem>
        <price>100</price>
        <quantity>50 pieces</quantity>
        <discount>5%</discount>
    </store>

    <store slNo="4">
        <foodItem>milk</foodItem>
        <price>50</price>
        <quantity>1 litre</quantity>
        <discount>3%</discount>
    </store>

</Food>

Tried Below code but getting error with same.

import xml.etree.ElementTree as ET
import pandas as pd

ifilepath = r'C:\DATA_DIR\feeds\test\sample.xml'
ofilepath = r'C:\DATA_DIR\feeds\test\sample.csv'
root = ET.parse(ifilepath).getroot()

print(root)
with open(ofilepath, "w") as file:
    for child in root:
        print(child.tag, child.attrib)
        # naive example how you could save to csv line wise
        file.write(child.tag+";"+child.attrib)

Above code is able to find root node, but unable to concatenate its attributes though

Tried one more code, but this works for 1 level nested XML, who about getting 3-4 nested tags in same XML file. And currently able to print values of all tags and their text. need to convert these into relational model { CSV file}

import xml.etree.ElementTree as ET

tree = ET.parse(ifilepath)
root = tree.getroot()
for member in root.findall('*'):
    print(member.tag,member.attrib)
    for i in (member.findall('*')):
        print(i.tag,i.text)

Above example works well with pandas read_xml { using lxml parser}

But when we try to use the similar way out for below XML data, it doesn't produce indicator ID value and Country ID value as output in CSV file

Example Data ::

<?xml version="1.0" encoding="UTF-8"?>
<du:data xmlns:du="http://www.dummytest.org" page="1" pages="200" per_page="20" total="1400" sourceid="5" sourcename="Dummy ID Test" lastupdated="2022-01-01">
   <du:data>
      <du:indicator id="AA.BB">various, tests</du:indicator>
      <du:country id="MM">test again</du:country>
      <du:date>2021</du:date>
      <du:value>1234567</du:value>
      <du:unit />
      <du:obs_status />
      <du:decimal>0</du:decimal>
   </du:data>
   <du:data>
      <du:indicator id="XX.YY">testing, cases</du:indicator>
      <du:country id="DD">coverage test</du:country>
      <du:date>2020</du:date>
      <du:value>3456223</du:value>
      <du:unit />
      <du:obs_status />
      <du:decimal>0</du:decimal>
   </du:data>
</du:data>

Solution Tried ::

import pandas as pd
    
pd.read_xml(ifilepath, xpath='.//du:data', namespaces= {"du": "http://www.dummytest.org"}).to_csv(ofilepath, sep=',', index=None, header=True)

Output Got ::

indicator,country,date,value,unit,obs_status,decimal
"various, tests",test again,2021,1234567,,,0
"testing, cases",coverage test,2020,3456223,,,0

Expected output ::

indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0

Adding Example data , having usage of 2 or more xpath's. Looking for ways to convert the same using pandas to_csv()

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl'?>
<CATALOG>
    <PLANT>
    <COMMON>rose</COMMON>
    <BOTANICAL>canadensis</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Shady</LIGHT>
    <PRICE>202</PRICE>
    <AVAILABILITY>446</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>mango</COMMON>
    <BOTANICAL>sunny</BOTANICAL>
    <ZONE>3</ZONE>
    <LIGHT>shady</LIGHT>
    <PRICE>301</PRICE>
    <AVAILABILITY>569</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>Marigold</COMMON>
    <BOTANICAL>palustris</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Sunny</LIGHT>
    <PRICE>500</PRICE>
    <AVAILABILITY>799</AVAILABILITY>
    </PLANT>
    <PLANT>
    <COMMON>carrot</COMMON>
    <BOTANICAL>Caltha</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>sunny</LIGHT>
    <PRICE>205</PRICE>
    <AVAILABILITY>679</AVAILABILITY>
    </PLANT>
    <FOOD>
    <NAME>daal fry</NAME>
    <PRICE>300</PRICE>
    <DESCRIPTION>
    Famous daal tadka from surat
    </DESCRIPTION>
    <CALORIES>60</CALORIES>
    </FOOD>
    <FOOD>
    <NAME>Dhosa</NAME>
    <PRICE>350</PRICE>
    <DESCRIPTION>
    The famous south indian dish
    </DESCRIPTION>
    <CALORIES>80</CALORIES>
    </FOOD>
    <FOOD>
    <NAME>Khichdi</NAME>
    <PRICE>150</PRICE>
    <DESCRIPTION>
    The famous gujrati dish
    </DESCRIPTION>
    <CALORIES>40</CALORIES>
    </FOOD>
    <BOOK>
      <AUTHOR>Santosh Bihari</AUTHOR>
      <TITLE>PHP Core</TITLE>
      <GENER>programming</GENER>
      <PRICE>44.95</PRICE>
      <DATE>2000-10-01</DATE>
   </BOOK>
   <BOOK>
      <AUTHOR>Shyam N Chawla</AUTHOR>
      <TITLE>.NET Begin</TITLE>
      <GENER>Computer</GENER>
      <PRICE>250</PRICE>
      <DATE>2002-17-05</DATE>
   </BOOK>
   <BOOK>
      <AUTHOR>Anci C</AUTHOR>
      <TITLE>Dr. Ruby</TITLE>
      <GENER>Computer</GENER>
      <PRICE>350</PRICE>
      <DATE>2001-04-11</DATE>
   </BOOK>
</CATALOG>

StackOverflow is not a free code-writing service. Please research for solutions to this regular problem and make an earnest attempt at solution. Come back with a specific issue regarding your implementation. — Parfait
– Parfait, Commented Oct 25, 2022 at 13:50
We understand what StackOverflow is . Have tried many a ways , but looking for a generic way to convert nested XML to CSV format. — Eja
– Eja, Commented Oct 27, 2022 at 12:31
Error :: file.write(child.tag+";"+child.attrib) TypeError: can only concatenate str (not "dict") to str <Element 'Food' at 0x000002603F6139A8> Info {} — Eja
– Eja, Commented Oct 27, 2022 at 12:32
Please edit your post with attempted code and not in long, hard-to-read comments. Once done, please delete your comments. — Parfait
– Parfait, Commented Oct 27, 2022 at 20:28

Jack Fleeting · Accepted Answer · 2022-11-01 11:22:43Z

1

ElementTree is not really the best tool for what I believe you're trying to do. Since you have well-formed, relatively simple xml, try using pandas:

import pandas as pd

#from here, it's just a one liner
pd.read_xml('input.xml',xpath='.//store').to_csv('output.csv',sep=',', index = None, header=True)

and that should get you your csv file.

answered Nov 1, 2022 at 11:22

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eja Over a year ago

thanks, this works well on simple XML files. But is there a way we can convert XML files to CSV generically, without knowing its xpath. And every subtag can be appended to its previous tag name as new column.

Eja Over a year ago

Tried this way on shared data in edited Description , output is not providing all the column values { specially not for "indicator id="AA.BB"" and country id="MM"

Jack Fleeting Over a year ago

@Eja Of course it doesn't work. You newly edited sample xml is nothing like the pre-edit sample xml...

Parfait · Accepted Answer · 2022-11-03 13:19:21Z

1

Given parsing element values and their corresponding attributes involves a second layer of iteration, consider a nested list/dict comphrehension with dictionary merge. Also, use csv.DictWriter to build CSV via dictionaries:

from csv import DictWriter
import xml.etree.ElementTree as ET

ifilepath = "Input.xml"

tree = ET.parse(ifilepath)
nmsp = {"du": "http://www.dummytest.org"}

data = [
     {
       **{el.tag.split('}')[-1]: (el.text.strip() if el.text is not None else None) for el in d.findall("*")},
       **{f"{el.tag.split('}')[-1]} {k}":v for el in d.findall("*") for k,v in el.attrib.items()},
       **d.attrib
     }     
     for d in tree.findall(".//du:data", namespaces=nmsp)    
]

dkeys = list(data[0].keys())

with open("DummyXMLtoCSV.csv", "w", newline="") as f:
    dw = DictWriter(f, fieldnames=dkeys)
    dw.writeheader()
    
    dw.writerows(data)

Output

indicator,country,date,value,unit,obs_status,decimal,indicator id,country id
"various, tests",test again,2021,1234567,,,0,AA.BB,MM
"testing, cases",coverage test,2020,3456223,,,0,XX.YY,DD

While above will add attributes to last columns of CSV. For specific ordering, re-order the dictionaries:

data = [ ... ]

cols = ["indicator id", "indicator", "country id", "country", "date", "value", "unit", "obs_status", "decimal"]

data = [
    {k: d[k] for k in cols} for d in data
]

with open("DummyXMLtoCSV.csv", "w", newline="") as f:
    dw = DictWriter(f, fieldnames=cols)
    dw.writeheader()
    
    dw.writerows(data)

Output

indicator id,indicator,country id,country,date,value,unit,obs_status,decimal
AA.BB,"various, tests",MM,test again,2021,1234567,,,0
XX.YY,"testing, cases",DD,coverage test,2020,3456223,,,0

edited Nov 3, 2022 at 13:19

answered Nov 1, 2022 at 21:00

Parfait

108k19 gold badges103 silver badges138 bronze badges

10 Comments

Eja Over a year ago

This works well Parfait. But if i try to run the same code for a simple XML file { the first example with <?xml version="1.0" encoding="UTF-8"?> }. This gives me error as "*{el.tag.split('}')[1]: (el.text.strip() if el.text is not None else None) for el in d.findall("")}, IndexError: list index out of range"

Eja Over a year ago

Any generic way to handle both the XML files conversion to CSV. ?

Parfait Over a year ago

See edit, adjusting index to -1 and adding the top-level attributes.

Eja Over a year ago

Yes, using -1 as index, and using top-level attributes works well on simple XML file also. But can we look for some generic code, that can cover both the examples together. ?

Eja Over a year ago

Also, instead of writing this data in to a file, can we store the same in a dataframe. That will be an ease to then convert that dataframe in to xls,csv or any other form

|

Collectives™ on Stack Overflow

How can we convert a nested XML to CSV in Python Dynamically, Nested XML may contain array of values as well?

2 Answers 2

3 Comments

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related