Convert xml file to pandas dataframe

Question

I am new to using pandas with xml data and I can't figure out how to convert an xml file to pandas dataframe using the standard read_xml function. I tried the following code, but it is not picking up the data fields

import pandas as pd

xml='''
<TimeSeries xmlns="http://www.wldelft.nl/fews/PI" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.wldelft.nl/fews/PI http://fews.wldelft.nl/schemas/version1.0/pi-schemas/pi_timeseries.xsd" version="1.26" xmlns:fs="http://www.wldelft.nl/fews/fs">
    <timeZone>1.0</timeZone>
    <series>
        <header>
            <type>instantaneous</type>
            <moduleInstanceId>pr.pompvolumes</moduleInstanceId>
            <locationId>SL000246</locationId>
            <parameterId>Q.B.d</parameterId>
            <timeStep unit="second" multiplier="86400"/>
            <startDate date="2018-01-01" time="00:00:00"/>
            <endDate date="2022-01-01" time="00:00:00"/>
            <missVal>NaN</missVal>
            <stationName>Putten vijzel</stationName>
            <lat>52.263570497449855</lat>
            <lon>5.495717667656339</lon>
            <x>162408.0</x>
            <y>475066.0</y>
            <units>m3/s</units>
        </header>
        <event date="2018-01-01" time="00:00:00" value="1.262" flag="0"/>
        <event date="2018-01-02" time="00:00:00" value="1.456" flag="0"/>
        <event date="2018-01-03" time="00:00:00" value="0.845" flag="0"/>
        <event date="2018-01-04" time="00:00:00" value="1.507" flag="0"/>
        <event date="2018-01-05" time="00:00:00" value="1.083" flag="0"/>
        <event date="2018-01-06" time="00:00:00" value="0.516" flag="0"/>
        </series>
</TimeSeries>
'''

df = pd.read_xml(xml)

The resulting dataframe should have a format such as:

data = [['2018-01-01', 1.262, 0], ['2018-01-02', 1.456, 0], ['2018-01-03', 0.845, 0]]
df = pd.DataFrame(data, columns=['event date', 'value', 'flag' ])

Any help greatly appreciated!

Your XML contains a default namespace which you must acknowledge or nothing will be parsed. — Parfait
– Parfait, Commented Sep 19, 2022 at 14:32

ouroboros1 · Accepted Answer · 2022-09-19 12:43:26Z

3

Use pd.read_xml with a dict assigned to the namespaces parameter, where the key is a "temporary namespace prefix" (e.g. doc), and the value references the namespace denoted as xmlns (so: http://www.wldelft.nl/fews/PI).
This dict is then used to find the correct xpath. Here: 'doc:series/doc:event'.

df = pd.read_xml(xml, xpath='doc:series/doc:event', 
                 namespaces={'doc':'http://www.wldelft.nl/fews/PI'})

print(df)

         date      time  value  flag
0  2018-01-01  00:00:00  1.262     0
1  2018-01-02  00:00:00  1.456     0
2  2018-01-03  00:00:00  0.845     0
3  2018-01-04  00:00:00  1.507     0
4  2018-01-05  00:00:00  1.083     0
5  2018-01-06  00:00:00  0.516     0

# drop `time`
df.drop('time', axis=1, inplace=True)

edited Sep 19, 2022 at 12:43

answered Sep 19, 2022 at 12:38

ouroboros1

15.2k7 gold badges49 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Vovin Over a year ago

You can specify required column names as list in kwargs.

ouroboros1 Over a year ago

@Vovin: how do you do this? I mean, I know that you can pass names for the columns, but this will just rename the returned columns in order and then cut off the rest. E.g. in this case, if we do not pass anything, we get ['date','time','value','flag']. If you pass names=['date','value'], this will return the values from ['date','time'], but merely renamed: ['date','value'].

Vovin Over a year ago

I am sorry. I was inattentive, you are right.

ouroboros1 Over a year ago

@Vovin: no worries, would be a nice feature, surely!

Collectives™ on Stack Overflow

Convert xml file to pandas dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related