Handling XML data from python requests

Question

I am creating a simple web scraper. However I am having an issue handling XML data properly, specifically, after creating an xml element, I find that my element does not contain any children nodes (I expected there to be). Am i missing something obvious here?

My code:

import xml.etree.ElementTree as ET
import requests

with requests.session() as s:
    s.post(Urllog, data=payload)
    x = s.post(Urlcourses, data= formdata)
    root = ET.fromstring(x.content)
    print(x.content)

A few examples of the element having no children:

>>> root.tag
'contents'
>>>
>>> for child in root:
...     print(child.tag) #does not return anything
...
>>>

>>> root[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: child index out of range
>>>

x.content is as expected, as follows:

    <?xml version="1.0"?>
<contents><![CDATA[ 
<!-- Display system announcements -->
  <div class="noItems divider">No Institution Announcements have been posted in the last 7 days.</div>
      <!-- Display course/org announcements -->
  <h3>xxx (S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > Lecture Recordings + Tutorial Sheet</a></li>
          </ul>
        </div>
        <h3>xxx (S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > Tutorials / consultation hours</a></li>
          <li><a
              href=xxx
            > 2014 lectures uploaded</a></li>
          </ul>
        </div>
        <h3>xxx(S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > PASS - Peer Assisted Study Sessions</a></li>
          </ul>
        </div>
        <h3>xxxx</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxxx2_1"
            > xxx!</a></li>
          <li><a
              href=xxx
            > Careers for Engineers: A session from Engineers Australia</a></li>
          </ul>
        </div>
        <div class="moduleControlWrapper u_reverseAlign">
    <a class="button-6"
        href=xxxx
      >more announcements...</a>
    </div>



                 ]]></contents>

Anand S Kumar · Accepted Answer · 2015-08-05 06:37:20Z

2

The xml you got in root is actually correct , since if you check your xml , it has -

<contents><![CDATA[

It only has one node, contents , and the rest are actually CDATA text inside it.

You can access them using root.text . Also, they do not seem to be actual xml as it has a not closed <div> tag, you may want to consider using some html parsing library to parse that text, rather than xml.etree.ElementTree , maybe BeautifulSoup .

edited Aug 5, 2015 at 6:37

answered Aug 5, 2015 at 6:09

Anand S Kumar

91.4k18 gold badges196 silver badges179 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3636636 Over a year ago

Thanks Anand! I used beautiful Soup for this and worked perfectly

Collectives™ on Stack Overflow

Handling XML data from python requests

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related