How to convert an XML file to nice pandas dataframe?

Question

Let's assume that I have an XML like this:

<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
    <documents count="N">
        <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
    </documents>
</author>

I would like to read this XML file and convert it to a pandas DataFrame:

key                                         type     language    feature            web                         data
e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
bc360cfbafc39970587547215162f0db             XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce     XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42         XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]

This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:

from lxml import objectify
import pandas as pd

path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))

for i in range(0,len(xml)):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

Could anybody provide me a better aproach for this problem?

End genocide - save Gaza · Accepted Answer · 2021-04-21 07:52:11Z

61

You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iter('document'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['data'] = doc.text
        yield doc_dict

xml_data = io.StringIO(u'''YOUR XML STRING HERE''')

etree = ET.parse(xml_data) #create an ElementTree object 
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):
    for author in etree.iter('author'):
        for row in iter_docs(author):
            yield row

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.

edited Apr 21, 2021 at 7:52

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

answered Feb 1, 2015 at 20:08

JaminSore

3,9761 gold badge27 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

birgersp Over a year ago

Question was about how to load a xml file, so ideally your answer should have addressed that, rather than loading a xml string...

JaminSore Over a year ago

@gromit190 Good point. I've updated my answer for reading from a file.

Cristian Ciupitu Over a year ago

Tiny nitpick: xml_data = io.StringIO(''' needs to be replaced with xml_data = io.StringIO(u''' because the parameter for StringIO needs to be unicode. Otherwise you get "TypeError: initial_value must be unicode or None, not str".

JaminSore Over a year ago

@CristianCiupitu I see the question is tagged python-2.7 ---u prefix has been added.

Dharman · Accepted Answer · 2025-05-15 11:57:46Z

33

As of v1.3, you can simply use:

pandas.read_xml(path_or_file)

edited May 15 at 11:57

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Mar 25, 2021 at 8:24

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

1 Comment

Parfait Over a year ago

Actually, from this specific post, OP needs to adjust XPath to look one level deeper from root: pandas.read_xml(path_or_file, xpath="/Author/document")

Davide Fiocco · Accepted Answer · 2019-05-21 17:01:10Z

16

Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.

import pandas as pd
import xml.etree.ElementTree as ET

xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'

etree = ET.fromstring(xml_str)
dfcols = ['id', 'name']
df = pd.DataFrame(columns=dfcols)

for i in etree.iter(tag='data'):
    df = df.append(
        pd.Series([i.get('id'), i.get('name')], index=dfcols),
        ignore_index=True)

df.head()

edited May 21, 2019 at 17:01

Davide Fiocco

6,0395 gold badges43 silver badges79 bronze badges

answered May 29, 2018 at 6:57

Jai Prakash

2,8894 gold badges29 silver badges28 bronze badges

Comments

End genocide - save Gaza · Accepted Answer · 2021-03-25 08:28:45Z

4

Chiming in to recommend the use of the xmltodict library. It handled your xml text pretty well and I've used it for ingesting an xml file with almost a million records.

edited Mar 25, 2021 at 8:28

End genocide - save Gaza

25k10 gold badges113 silver badges133 bronze badges

answered Aug 29, 2020 at 2:32

janoulle

1,95827 silver badges30 bronze badges

Comments

Naveen Kaushik · Accepted Answer · 2019-04-23 03:41:22Z

You can also convert by creating a dictionary of elements and then directly converting to a data frame:

import xml.etree.ElementTree as ET
import pandas as pd

# Contents of test.xml
# <?xml version="1.0" encoding="utf-8"?> <tags>   <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" />   <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" />   <row Id="3" TagName="elicitation" Count="10" />   <row Id="5" TagName="open-source" Count="16" /> </tags>

root = ET.parse('test.xml').getroot()

tags = {"tags":[]}
for elem in root:
    tag = {}
    tag["Id"] = elem.attrib['Id']
    tag["TagName"] = elem.attrib['TagName']
    tag["Count"] = elem.attrib['Count']
    tags["tags"]. append(tag)

df_users = pd.DataFrame(tags["tags"])
df_users.head()

Collectives™ on Stack Overflow

How to convert an XML file to nice pandas dataframe?

5 Answers 5

4 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related