4

I wanna build a RSS Feed Crawler for my website. Though im not quite sure, how to begin this. How can my Crawler identify the RSS feed? Is there any thing I can crawl for, which every RSS reader has? I don't need any code, just some help for my brain to understand what I have to create.

Thanks in before!

Greetings

Xatenev

2
  • Check superfeedr.com if you don't feel like re-inventing the wheel :) Commented Apr 11, 2014 at 9:39
  • Hey, it seems very cool, but what can I do with that? :P It seems like a huge database for feeds, where i (possibly) get a lot of RSS feeds. Is that correct?^^ Commented Apr 11, 2014 at 10:11

1 Answer 1

2

I think it would be possible if your crawler scans all links and opens each page at least one time to look for the text <rss version="2.0">. From what I understand, every RSS feed should contain this line.

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
 <title>RSS Title</title>
 <description>This is an example of an RSS feed</description>
 <link>http://www.someexamplerssdomain.com/main.html</link>
 <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>
 <pubDate>Mon, 06 Sep 2009 16:20:00 +0000 </pubDate>
 <ttl>1800</ttl>

 <item>
  <title>Example entry</title>
  <description>Here is some text containing an interesting description.</description>
  <link>http://www.wikipedia.org/</link>
  <guid>unique string per item</guid>
  <pubDate>Mon, 06 Sep 2009 16:20:00 +0000 </pubDate>
 </item>

</channel>
</rss>

If you're going to use PHP, I have very positive experiences with SimpleXML which is built in PHP.

P.S. Xatenev you're welcome ;)

Sign up to request clarification or add additional context in comments.

5 Comments

And how can I actually crawl for those RSS feeds? How can my crawler identify those, and give me the data back, which I need?
I don't know if you have much experience with regular expressions, I think that's the way to go.
I know about regular expressions, but I mean a crawler for example, just goes on a website and picks up all links, then he continues crawling on the other website. How can i pick up all RSS feeds on the website? Those links are easy found from the source code, can I find RSS feeds from the source code aswell?
Could you clarify that some more? I think it would be possible if your crawler scans all links and opens each page at least one time to look for the text "<rss version="2.0">". From what I understand, every RSS feed should contain this line.
Ah thats what i wanted to know, very cool, thank you! Thanks for the good explanation :).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.