1

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.

tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")

What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.

Some data to work with:

<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster@url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
    <applicationOpen>true</applicationOpen>
    <applicationType>individual</applicationType>
    <maxNumberOfParticipants>12</maxNumberOfParticipants>
    <minNumberOfParticipants>8</minNumberOfParticipants>
    <paymentDue>up-front</paymentDue>
    <requiredLevel>academic bachelor</requiredLevel>
    <startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
    <instructionMode>training</instructionMode>
    <teacher>
        <id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
        <name>SomeName</name>
        <summary xml:lang="nl">
        Long text of the summary. Not needed.
        </summary>
    </teacher>
    <studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
    <programName xml:lang="nl">Program Course Name</programName>
    <programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
    <programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
    <programDescriptionText xml:lang="nl">This part is needed from the XML.
        Big program description text. This part is needed to parse from the XML file.
    </programDescriptionText>
    <programDescriptionHtml xml:lang="nl">Not needed;
        Not needed as well;
    </programDescriptionHtml>
    <subjectText>
        <subject>curriculum</subject>
        <header1 xml:lang="nl">Beschrijving</header1>
        <descriptionHtml xml:lang="nl">Yet another HTML desscription;
            Not necessarily needed;</descriptionHtml>
        </subjectText>
    <searchword xml:lang="nl">search word</searchword>
    <webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
    <programRun>
        <id>PR-019514</id>
        <status>application opened</status>
        <startDate isFinal="true">2019-06-26</startDate>
        <endDate isFinal="true">2020-02-11</endDate>
    </programRun>
</programSchedule>
</program>
</programs>
4
  • The XML in the post is not valid. The element descriptionHtml is not closed. Please provide a valid XML Commented Apr 16, 2019 at 8:48
  • My apologies. Found the problem and fixed it. Commented Apr 16, 2019 at 9:24
  • What is the text that you want to collect in this XML? Is it short Program Course Name summary ? Commented Apr 16, 2019 at 9:26
  • Apparently it doesn't allow text between < >. Once again I've edited my question, and for the sake of double typing: programDescriptionText xml:lang="nl" Commented Apr 16, 2019 at 9:41

1 Answer 1

1

Try the code below: (55703748.xml contains the xml you have posted)

import xml.etree.ElementTree as ET

tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
    print(node.text)

Output

short Program Course Name summary
Sign up to request clarification or add additional context in comments.

1 Comment

The final, the for loop with node.text, part did it. Thanks for helping me out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.