Parse deeply nested XML to pandas dataframe

Question

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.

tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")

What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.

Some data to work with:

<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster@url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
    <applicationOpen>true</applicationOpen>
    <applicationType>individual</applicationType>
    <maxNumberOfParticipants>12</maxNumberOfParticipants>
    <minNumberOfParticipants>8</minNumberOfParticipants>
    <paymentDue>up-front</paymentDue>
    <requiredLevel>academic bachelor</requiredLevel>
    <startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
    <instructionMode>training</instructionMode>
    <teacher>
        <id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
        <name>SomeName</name>
        <summary xml:lang="nl">
        Long text of the summary. Not needed.
        </summary>
    </teacher>
    <studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
    <programName xml:lang="nl">Program Course Name</programName>
    <programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
    <programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
    <programDescriptionText xml:lang="nl">This part is needed from the XML.
        Big program description text. This part is needed to parse from the XML file.
    </programDescriptionText>
    <programDescriptionHtml xml:lang="nl">Not needed;
        Not needed as well;
    </programDescriptionHtml>
    <subjectText>
        <subject>curriculum</subject>
        <header1 xml:lang="nl">Beschrijving</header1>
        <descriptionHtml xml:lang="nl">Yet another HTML desscription;
            Not necessarily needed;</descriptionHtml>
        </subjectText>
    <searchword xml:lang="nl">search word</searchword>
    <webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
    <programRun>
        <id>PR-019514</id>
        <status>application opened</status>
        <startDate isFinal="true">2019-06-26</startDate>
        <endDate isFinal="true">2020-02-11</endDate>
    </programRun>
</programSchedule>
</program>
</programs>

The XML in the post is not valid. The element descriptionHtml is not closed. Please provide a valid XML — balderman
– balderman, Commented Apr 16, 2019 at 8:48
What is the text that you want to collect in this XML? Is it short Program Course Name summary ? — balderman
– balderman, Commented Apr 16, 2019 at 9:26
Apparently it doesn't allow text between < >. Once again I've edited my question, and for the sake of double typing: programDescriptionText xml:lang="nl" — Wokkel
– Wokkel, Commented Apr 16, 2019 at 9:41

balderman · Accepted Answer · 2019-04-16 10:01:41Z

1

Try the code below: (55703748.xml contains the xml you have posted)

import xml.etree.ElementTree as ET

tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
    print(node.text)

Output

short Program Course Name summary

answered Apr 16, 2019 at 10:01

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wokkel Over a year ago

The final, the for loop with node.text, part did it. Thanks for helping me out.

Collectives™ on Stack Overflow

Parse deeply nested XML to pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related