7

I have an XML file that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>

What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:

import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)

root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
    ID = row.find('Id')
    text = row.find('Text')
    date = row.find('CreationDate')
    print(ID, text, date)
    df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)

print(df_xml)

But the output is:

None None None

How do I fix this?

0

4 Answers 4

4

As advised in this solution by gold member Python/pandas/numpy guru, @unutbu:

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:

path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
rows = root.findall('.//row')

# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')] 
            for row in rows]

df_xml = pd.DataFrame(xml_data, columns=dfcols)

print(df_xml)

#   ID   Text             CreationDate
# 0  1  (...)  2011-08-30T21:15:28.063
# 1  2  (...)  2011-08-30T21:24:56.573
# 2  3  (...)                     None
Sign up to request clarification or add additional context in comments.

Comments

3

Just a minor change in your code

ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

Comments

3

Based on @Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.

test.xml:

<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>

xml_to_pandas.py:

'''Xml to Pandas DataFrame Convertor.'''

import xml.etree.cElementTree as et
import pandas as pd


def xml_to_pandas(root, columns, row_name):
  '''get xml.etree root, the columns and return Pandas DataFrame'''
  df = None
  try:

    rows = root.findall('.//{}'.format(row_name))

    xml_data = [[row.get(c) for c in columns] for row in rows]  # NESTED LIST

    df = pd.DataFrame(xml_data, columns=columns)
  except Exception as e:
    print('[xml_to_pandas] Exception: {}.'.format(e))

  return df


path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']

root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)

output:

enter image description here

Comments

0

Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.

path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""

# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)

The XML doc in the OP becomes the following by simply calling read_xml:

res

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.