0

I am creating a simple web scraper. However I am having an issue handling XML data properly, specifically, after creating an xml element, I find that my element does not contain any children nodes (I expected there to be). Am i missing something obvious here?

My code:

import xml.etree.ElementTree as ET
import requests

with requests.session() as s:
    s.post(Urllog, data=payload)
    x = s.post(Urlcourses, data= formdata)
    root = ET.fromstring(x.content)
    print(x.content)

A few examples of the element having no children:

>>> root.tag
'contents'
>>>
>>> for child in root:
...     print(child.tag) #does not return anything
...
>>>

>>> root[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: child index out of range
>>>

x.content is as expected, as follows:

    <?xml version="1.0"?>
<contents><![CDATA[ 
<!-- Display system announcements -->
  <div class="noItems divider">No Institution Announcements have been posted in the last 7 days.</div>
      <!-- Display course/org announcements -->
  <h3>xxx (S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > Lecture Recordings + Tutorial Sheet</a></li>
          </ul>
        </div>
        <h3>xxx (S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > Tutorials / consultation hours</a></li>
          <li><a
              href=xxx
            > 2014 lectures uploaded</a></li>
          </ul>
        </div>
        <h3>xxx(S2 2015)</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxx
            > PASS - Peer Assisted Study Sessions</a></li>
          </ul>
        </div>
        <h3>xxxx</h3>
          <div class="courseDataBlock">
        <ul>
          <li><a
              href=xxxx2_1"
            > xxx!</a></li>
          <li><a
              href=xxx
            > Careers for Engineers: A session from Engineers Australia</a></li>
          </ul>
        </div>
        <div class="moduleControlWrapper u_reverseAlign">
    <a class="button-6"
        href=xxxx
      >more announcements...</a>
    </div>



                 ]]></contents>

1 Answer 1

2

The xml you got in root is actually correct , since if you check your xml , it has -

<contents><![CDATA[ 

It only has one node, contents , and the rest are actually CDATA text inside it.

You can access them using root.text . Also, they do not seem to be actual xml as it has a not closed <div> tag, you may want to consider using some html parsing library to parse that text, rather than xml.etree.ElementTree , maybe BeautifulSoup .

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Anand! I used beautiful Soup for this and worked perfectly

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.