0

I have XML file which looks like this: enter image description here

I'm trying to conver it to DF in Python using this:

import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 
"""Parse the input XML file and store the result in a pandas 
DataFrame with the given columns. 

The first element of df_cols is supposed to be the identifier 
variable, which is an attribute of each node element in the 
XML data; other features will be parsed from the text content 
of each sub-element. 
"""

xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []

for node in xroot: 
    res = []
    res.append(node.attrib.get(df_cols[0]))
    for el in df_cols[1:]: 
        if node is not None and node.find(el) is not None:
            res.append(node.find(el).text)
        else: 
            res.append(None)
    rows.append({df_cols[i]: res[i] 
                 for i, _ in enumerate(df_cols)})

out_df = pd.DataFrame(rows, columns=df_cols)
    
return out_df

I'm using function like this:

parse_XML(R'C:\Users\aleks\Desktop\zadania python+sql\import_xml.xml', ['Year', 'Row', 'New', 'Used', 'Total Sales New', 'Total Sales Used'])

But the result instead of full imported XML file is this: enter image description here

How do I make it work so I have full XML file imported to DataFrame? Thanks.

2

1 Answer 1

1

Since your XML is relatively flat, consider merging dictionaries of element and attribute data to be passed into pandas.DataFrame constructor. Specifically, dictionaries are derived via list/dict comprehension of tag and text parsing of row elements and attrib dictionary.

xtree = et.parse(xml_file)

# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
        } for row in xtree.findall('.//row/row')]

out_df = pd.DataFrame(data)

To demonstrate from OpenData Michigan MVA Vehicle Sales Counts (which resembles OP's image data):

import pandas as pd
import xml.etree.ElementTree as et
from urllib import request

url = "https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD"
rq = request.urlopen(url)
xtree = et.fromstring(rq.read())

# LIST OF DICTIONARIES
data = [{ **{r.tag:r.text for r in row.findall('*')}, **row.attrib
        } for row in xtree.findall('.//row/row')]

out_df = pd.DataFrame(data)

Output

out_df

#     year month    new   used total_sales_new total_sales_used                 _id                                 _uuid _position                                           _address
#0    2002   JAN  31106  49927       755015820        386481929  row-a49t-ihm3.rpap  00000000-0000-0000-05D6-978893D301BF         0  https://opendata.maryland.gov/resource/un65-7i...
#1    2002   FEB  27520  50982       664454223        361353242  row-njtn_crc8_qawh  00000000-0000-0000-3211-E4A718B8A756         0  https://opendata.maryland.gov/resource/un65-7i...
#2    2002   MAR  34225  58794       805666244        419385387  row-tn8d-5xax-gcqz  00000000-0000-0000-9FC0-AB50DEE551EE         0  https://opendata.maryland.gov/resource/un65-7i...
#3    2002   APR  36452  59817       846368297        433061150  row-bjnm-jnr7.7rbw  00000000-0000-0000-FE23-E2E416AC13DB         0  https://opendata.maryland.gov/resource/un65-7i...
#4    2002   MAY  37359  60577       855005784        442569410  row-5wkh~fkp3.mzuj  00000000-0000-0000-CBC6-6E4FF3707E07         0  https://opendata.maryland.gov/resource/un65-7i...
#..    ...   ...    ...    ...             ...              ...                 ...                                   ...       ...                                                ...
#221  2020   JUN  23704  63395       821297823        699671992  row-m9ah.mpqr.7eww  00000000-0000-0000-3599-5C2EEABDA225         0  https://opendata.maryland.gov/resource/un65-7i...
#222  2020   JUL  22748  65652       784818757        608761285  row-79ht-rw2j~j4vy  00000000-0000-0000-4E88-6D5061A28AC5         0  https://opendata.maryland.gov/resource/un65-7i...
#223  2020   AUG  12640  48594       456451505        329492612  row-yh7w-48wu_vpnx  00000000-0000-0000-6ED5-CA4A564695EA         0  https://opendata.maryland.gov/resource/un65-7i...
#224  2020   SEP  11336  42148       395164402        274497385  row-e2ur-7zck~4rdd  00000000-0000-0000-C640-6EE6DAA0F07C         0  https://opendata.maryland.gov/resource/un65-7i...
#225  2020   OCT  13271  43827       477111877        328752668  row-yn9u~un9g_3heh  00000000-0000-0000-4EEE-250E14B81104         0  https://opendata.maryland.gov/resource/un65-7i...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.