4

Assume, I want to get a page from the web to my application and make some sort of parsing with it. How do I do that? Where should I start from? Should be some plugins/gems required? What is your usual practice in resolving such type of tasks?

2 Answers 2

7

You should try Gems like Hpricot (wiki) or Nokogiri.

Hpricot example:

require 'open-uri'
require 'rubygems'
require 'hpricot'

html = Hpricot(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.search('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.search('img.test')

Nokogiri example:

require 'open-uri'
require 'rubygems'
require 'hpricot'

html = Nokogiri::HTML(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.xpath('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.css('img.test')

Nokogiri is generally faster. Both libraries feature a lot of functionality.

Sign up to request clarification or add additional context in comments.

Comments

0

What you want to do is called "Scraping"

Ryan Bates made two excelent screencasts on this topic:

I personally like Nokogiri more. You can also check out the following answer: Best Rails HTML Parser

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.