2

I am new to using pandas with xml data and I can't figure out how to convert an xml file to pandas dataframe using the standard read_xml function. I tried the following code, but it is not picking up the data fields

import pandas as pd

xml='''
<TimeSeries xmlns="http://www.wldelft.nl/fews/PI" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.wldelft.nl/fews/PI http://fews.wldelft.nl/schemas/version1.0/pi-schemas/pi_timeseries.xsd" version="1.26" xmlns:fs="http://www.wldelft.nl/fews/fs">
    <timeZone>1.0</timeZone>
    <series>
        <header>
            <type>instantaneous</type>
            <moduleInstanceId>pr.pompvolumes</moduleInstanceId>
            <locationId>SL000246</locationId>
            <parameterId>Q.B.d</parameterId>
            <timeStep unit="second" multiplier="86400"/>
            <startDate date="2018-01-01" time="00:00:00"/>
            <endDate date="2022-01-01" time="00:00:00"/>
            <missVal>NaN</missVal>
            <stationName>Putten vijzel</stationName>
            <lat>52.263570497449855</lat>
            <lon>5.495717667656339</lon>
            <x>162408.0</x>
            <y>475066.0</y>
            <units>m3/s</units>
        </header>
        <event date="2018-01-01" time="00:00:00" value="1.262" flag="0"/>
        <event date="2018-01-02" time="00:00:00" value="1.456" flag="0"/>
        <event date="2018-01-03" time="00:00:00" value="0.845" flag="0"/>
        <event date="2018-01-04" time="00:00:00" value="1.507" flag="0"/>
        <event date="2018-01-05" time="00:00:00" value="1.083" flag="0"/>
        <event date="2018-01-06" time="00:00:00" value="0.516" flag="0"/>
        </series>
</TimeSeries>
'''

df = pd.read_xml(xml)

The resulting dataframe should have a format such as:

data = [['2018-01-01', 1.262, 0], ['2018-01-02', 1.456, 0], ['2018-01-03', 0.845, 0]]
df = pd.DataFrame(data, columns=['event date', 'value', 'flag' ])

Any help greatly appreciated!

1
  • Your XML contains a default namespace which you must acknowledge or nothing will be parsed. Commented Sep 19, 2022 at 14:32

1 Answer 1

3
  • Use pd.read_xml with a dict assigned to the namespaces parameter, where the key is a "temporary namespace prefix" (e.g. doc), and the value references the namespace denoted as xmlns (so: http://www.wldelft.nl/fews/PI).
  • This dict is then used to find the correct xpath. Here: 'doc:series/doc:event'.
df = pd.read_xml(xml, xpath='doc:series/doc:event', 
                 namespaces={'doc':'http://www.wldelft.nl/fews/PI'})

print(df)

         date      time  value  flag
0  2018-01-01  00:00:00  1.262     0
1  2018-01-02  00:00:00  1.456     0
2  2018-01-03  00:00:00  0.845     0
3  2018-01-04  00:00:00  1.507     0
4  2018-01-05  00:00:00  1.083     0
5  2018-01-06  00:00:00  0.516     0

# drop `time`
df.drop('time', axis=1, inplace=True)
Sign up to request clarification or add additional context in comments.

4 Comments

You can specify required column names as list in kwargs.
@Vovin: how do you do this? I mean, I know that you can pass names for the columns, but this will just rename the returned columns in order and then cut off the rest. E.g. in this case, if we do not pass anything, we get ['date','time','value','flag']. If you pass names=['date','value'], this will return the values from ['date','time'], but merely renamed: ['date','value'].
I am sorry. I was inattentive, you are right.
@Vovin: no worries, would be a nice feature, surely!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.