Extract all urls inside a string in Ruby

Question

I have some text content with a list of URLs contained in it.

I am trying to grab all the URLs out and put them in an array.

I have this code

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"

urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)

I am trying to get the end results to be:

['http://www.google.com', 'http://www.google.com/index.html']

The above code does not seem to be working correctly. Does anyone know what I am doing wrong?

Thanks

balu · Accepted Answer · 2011-05-13 08:50:08Z

59

Easy:

ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
  => ["http://www.google.com", "http://www.google.com/index.html"]

edited May 13, 2011 at 8:50

answered May 9, 2011 at 16:42

balu

3,6891 gold badge27 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

adeluccar Over a year ago

This should be marked as the answer. Far more elegant.

amit_saxena Over a year ago

This has problems extracting URLs from markdown and includes the closing bracket in the URL. e.g. URI.extract("[link](https://www.example.com)" will return ["example.com)"].

FMc · Accepted Answer · 2010-02-19 16:37:01Z

6

A different approach, from the perfect-is-the-enemy-of-the-good school of thought:

urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

edited Feb 19, 2010 at 16:37

answered Feb 19, 2010 at 16:22

FMc

42.5k13 gold badges81 silver badges135 bronze badges

3 Comments

Chowlett Over a year ago

I'll give you simplicity. This may well be all that's needed.

Henley Wing Chiu Over a year ago

I graduated from that school!

sferik Over a year ago

This approach will miss many valid URLs and incorrectly select many invalid URLs.

Sam Saffron · Accepted Answer · 2012-05-07 06:07:36Z

I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:

[['http', '.google.com'], ...]

You'll need non-matching groups /(?:stuff)/ if you want the format you've given.

Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.

Further edit, after playing: I think you want something like this:

content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]

... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.

HaNdTriX · Accepted Answer · 2012-07-23 17:22:27Z

4

just for your interest:

Ruby has an URI Module, which has a regex implemented to do such things:

require "uri"

uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']

html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
  urls << $&
end

For more information visit the Ruby Ref: URI

answered Jul 23, 2012 at 17:22

HaNdTriX

29.9k11 gold badges82 silver badges86 bronze badges

Comments

amit_saxena · Accepted Answer · 2022-03-23 13:39:54Z

The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:

URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten

The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.

content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)

http://www.example.com/test3

http://www.example.com/test4"

returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Collectives™ on Stack Overflow

Extract all urls inside a string in Ruby

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related