-1

Is there a Ruby gem, or Ruby-esque way to check a webpage for broken links without crawling the actual links and checking for 404's, etc. Basically, I want a solution that works offline, and I want to detect links that are obviously syntactically broken, not links that point to web pages that don't exist.

So for instance, if a link points to "http//stackoverflow.com", that's a syntactically broken link, and I want to detect that. However if a link points to "http://www.webpagedoesnotexistyet.com" and it returns a 404, I'm OK with not detecting that.

5
  • 1
    Sounds like you want to use regex. I'd check out this post: stackoverflow.com/questions/4716513/… Commented Oct 30, 2013 at 16:50
  • What logic applies when relative links like /tags or /users are detected? Commented Oct 30, 2013 at 16:53
  • You'd still be using regex, first to find all a tags, then checking the href to make sure that it's either a valid full URL or begins with a / and doesn't include any invalid characters before the close quote. See this post: stackoverflow.com/questions/499345/… Commented Oct 30, 2013 at 17:25
  • Not a regex alone, but an XML parser to detect the href attribute for <a> tags, which then get passed through a regex. Commented Oct 30, 2013 at 17:31
  • 1
    And of course, the obligatory link to this post... Commented Oct 30, 2013 at 17:38

3 Answers 3

0

Use nokogiri to parse the HTML and URI.parse to check for valid URLs. URI will raise an error if it encounters what it considers to be an invalid url.

Sign up to request clarification or add additional context in comments.

1 Comment

It will happily eat 'http//stackoverflow.com', though.
0

Use this : Links below is an array of links

for link in links do
    begin
        url = URI.parse(link)
        req = Net::HTTP.new(url.host, url.port)
        res = req.request_head(url.path)

        if res.code == "200"
            puts "#{res.code} ok - #{link}"
        else
            puts "#{res.code} error - #{link}"
        end
    rescue
        puts "breaking for #{link}"
    end
end

Comments

0

You can use URI.regexp. If a string matches it, it is a valid uri.

require 'uri'

def valid_uri?(s)
  !!(s =~ URI.regexp)
end


valid_uri?('http//stackoverflow.com') # => false
valid_uri?('http://www.webpagedoesnotexistyet.com/') # => true

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.