Python XML parsing not working for some sites

Question

I have a very basic XML parser based on the tutorial provided here, for the purpose of reading RSS feeds in Python.

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        for item_node in xmldoc.documentElement.childNodes:
            if (item_node.nodeName == "item"):  
                PrintNodeItems(item_node, ["title","link"])
    else:
        print "error"

def PrintNodeItems(XmlNode, items):
    for item_node in XmlNode.childNodes:
        if item_node.nodeName in items:
            PrintNodesText(item_node)

def PrintNodesText(XmlNode):
    text = ""
    for text_node in XmlNode.childNodes:
        if(text_node.nodeType == Node.TEXT_NODE):
            text = text_node.nodeValue
    if (len(text)>0):
        print text
        print ""

I have tested the GetRSS function on the address provided in the tutorial (http://rss.slashdot.org/Slashdot/slashdot), and it works just fine, providing me with the correct feedback. However, my intention when learning how to write this module was to use it for reading the RSS feed at RedLetterMedia (http://redlettermedia.com/feed/). When I attempt to use the GetRSS function in the Python Shell on that address, I get a blank line as feedback instead of the correct results. I also tested it on CNN's "World" RSS feed, and received no results for that as well. I have used urllib.urlopen on all addresses and they all appear to use the same format for their nodes and child nodes (<item><title><description><link></item>).

I figure, as was the case for my previous question, there is probably something very obvious I am missing. Does anybody know what that is?

Edit: and for the record, my error message has not come up at all, but maybe that's because I integrated it into the code incorrectly; I would not put it beyond me.

update: Rewrote code from scratch using multiple answered questions on stackoverflow. Works like a charm!

def GetRSS(RSSurl):
    url_info = urllib.urlopen(RSSurl)
    if (url_info):
        xmldoc = minidom.parse(url_info)
    if (xmldoc):
        channel = xmldoc.getElementsByTagName('channel')
        for node in channel:
            item = xmldoc.getElementsByTagName('item')
            for node in item:
                alist = xmldoc.getElementsByTagName('link')
                for a in alist: 
                    linktext = a.firstChild.data
                    print linktext


def main():
    GetRSS('http://redlettermedia.com/feed/')

Fred Foo · Accepted Answer · 2012-02-08 10:12:39Z

1

The error is here:

for item_node in xmldoc.documentElement.childNodes:
    if (item_node.nodeName == "item"):

There is no root item element, just a channel. I found this out by just printing all the values of nodeName in the loop.

answered Feb 8, 2012 at 10:12

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jordan Over a year ago

So, I should replace "item" in that line with "channel"? I have just tried that, and it does return a result now: >>> GetRSS('http://redlettermedia.com/feed') http://redlettermedia.com I suppose this is a step up from no response at all, but it is still not the response I was trying to receive. Any idea why I am getting this response? I will examine all the nodes and my attempts to call them in the meantime, maybe it is another case of trying to call something that isn't there, like the root item element.

Fred Foo Over a year ago

@Jordan: you'll want to look for <item>s within the <channel>. Consider using LXML instead of minidom, that has XPath support.

Collectives™ on Stack Overflow

Python XML parsing not working for some sites

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related