parsing html page with net http

Question

In a previous question I have found an answer for a hacked - but working - way to parse the title from a page using

 url = %x(curl http://google.com)
 simian = curl.match(/<title>(.*)<\/title>/)[1]
 puts simian

now I wanted to know if there is a better way by using a ruby standard library like net/http to fetch the url (in lieu of curl).

Another issue is that if the pages has some non standard characters in the title it doesn't parse it and curl.match cannot be completed. I have tried

 simian = s.encode('UTF-8') and then
 simian = curl.match(/<title>(.*)<\/title>/)[1]

but it shows weird characters like 1# thanks in advance for your help

Sébastien Le Callonnec · Accepted Answer · 2012-09-07 20:42:27Z

1

Using nokogiri is probably the simplest solution:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.google.com'))
elt = doc.xpath('//title').first
puts elt.text() if !elt.nil?

edited Sep 7, 2012 at 20:42

answered Sep 7, 2012 at 20:29

Sébastien Le Callonnec

27k8 gold badges70 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

devnull Over a year ago

Hi Sebastien, this is true and it works but even Nokogiri fails in this case doc = Nokogiri::HTML(open('zales.1.ai')) nokotest2.rb:5:in <main>': undefined method text' for nil:NilClass (NoMethodError) any help will be much appreciated :)

Sébastien Le Callonnec Over a year ago

If the page you are trying to access has no title, then the xpath query will return nil, hence the error.

Sébastien Le Callonnec Over a year ago

I have edited to add a nil check, which is pretty much the best you can do if there is no title! ;)

Collectives™ on Stack Overflow

parsing html page with net http

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related