1

In a previous question I have found an answer for a hacked - but working - way to parse the title from a page using

 url = %x(curl http://google.com)
 simian = curl.match(/<title>(.*)<\/title>/)[1]
 puts simian

now I wanted to know if there is a better way by using a ruby standard library like net/http to fetch the url (in lieu of curl).

Another issue is that if the pages has some non standard characters in the title it doesn't parse it and curl.match cannot be completed. I have tried

 simian = s.encode('UTF-8') and then
 simian = curl.match(/<title>(.*)<\/title>/)[1]

but it shows weird characters like 1# thanks in advance for your help

1 Answer 1

1

Using nokogiri is probably the simplest solution:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.google.com'))
elt = doc.xpath('//title').first
puts elt.text() if !elt.nil?
Sign up to request clarification or add additional context in comments.

3 Comments

Hi Sebastien, this is true and it works but even Nokogiri fails in this case doc = Nokogiri::HTML(open('zales.1.ai')) nokotest2.rb:5:in <main>': undefined method text' for nil:NilClass (NoMethodError) any help will be much appreciated :)
If the page you are trying to access has no title, then the xpath query will return nil, hence the error.
I have edited to add a nil check, which is pretty much the best you can do if there is no title! ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.