I want to parse a Feedburner feed with HtmlUnit. The feed is this one: http://feeds.feedburner.com/alcoanewsreleases
From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.
groovy code snippet:
def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")
Sample of the XML feed:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
[...SNIP...]
<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
<title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
<dc:date>2011-05-18</dc:date
<link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
<description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
<feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>
[...SNIP...]
</rdf:RDF>
I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are
- (this is the default) xmlns="http://purl.org/rss/1.0/"
- xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
- xmlns:dc="http://purl.org/dc/elements/1.1/"
- xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.
I have tried the same XPath with HtmlUnit but it does not work.
So I think I can phrase my question as: How can I select a node from the default namespace with HtmlUnit?
Any ideas?