1

Good afternoon,

I'm learning about using RegEx's in Ruby, and have hit a point where I need some assistance. I am trying to extract 0 to many URLs from a string.

This is the code I'm using:

sStrings = ["hello world: http://www.google.com", "There is only one url in this string http://yahoo.com . Did you get that?", "The first URL in this string is http://www.bing.com and the second is http://digg.com","This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1", "This string contains no urls"]
sStrings.each  do |s|
  x = s.scan(/((http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.[\w-]*)?)/ix)
  x.each do |url|
    puts url
  end
end

This is what is returned:

http://www.google.com
http
.google
nil
nil
http://yahoo.com
http
nil
nil
nil
http://www.bing.com
http
.bing
nil
nil
http://digg.com
http
nil
nil
nil
http://is.gd/12345
http
nil
/12345
nil
http://is.gd/4567
http
nil
/4567
nil

What is the best way to extract only the full URLs and not the parts of the RegEx?

2 Answers 2

4

You could use anonymous capture groups (?:...) instead of (...).

I see that you are doing this in order to learn Regex, but in case you really want to extract URLs from a String, take a look at URI.extract, which extracts URIs from a String. (require "uri" in order to use it)

Sign up to request clarification or add additional context in comments.

Comments

1

You can create a non-capturing group using (?:SUB_PATTERN). Here's an illustration, with some additional simplifications thrown in. Also, since you're using the /x option, take advantage of it by laying out your regex in a readable way.

sStrings = [
    "hello world: http://www.google.com",
    "There is only one url in this string http://yahoo.com . Did you get that?",
    "... is http://www.bing.com and the second is http://digg.com",
    "This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1",
    "This string contains no urls",
]

sStrings.each  do |s|
    x = s.scan(/
        https?:\/\/
        \w+
        (?: [.-]\w+ )*
        (?:
            \/
            [0-9]{1,5}
            \?
            [\w=]*
        )?
    /ix)

    p x
end

This is fine for learning, but don't really try to match URLs this way. There are tools for that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.