How to do advanced URL parsing with RegEx?

Question

I'm using the following method to parse URLs:

Regex.Replace(text, @"((www\.|(http|https|ftp)\://)[.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])",
                            "<a href=\"$1\" target=\"&#95;blank\">$1</a>", RegexOptions.IgnoreCase).Replace("href=\"www.", "href=\"http://www.");

It works great, but:

asdhttp://google.com will be parsed, so how can I disallow characters before "http://" / "www"?
When a URL is inside a tag, I don't want it to be parsed:

[url]http://google.com[/url]

How can I do that?

how about URLs inside IMG and LINK tags, are they allowed to match? does "a tag" in your description means a tag? — Vantomex
– Vantomex, Commented Oct 14, 2010 at 10:57

Sachin Shanbhag · Accepted Answer · 2010-10-14 08:56:46Z

1

use ^ before http and www which means your string should start with http, www or https or ftp

^(www\.|(http|https|ftp)

answered Oct 14, 2010 at 8:56

Sachin Shanbhag

55.7k11 gold badges92 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alex Over a year ago

But then something like "google: http ://google.com" won't work

Sachin Shanbhag Over a year ago

@Alex: Do you have specific set of strings which need to be allowed or not? Because if you try to include google, then you will have to include adshttp as well. or you have to hardcode google like http|ftp|https|google

Alex Over a year ago

I just have to parse URLs in a text. Just like any forum works. "Hello, this is my website: http: //as.com" - URL should be parsed here. "Hihttp://as.com" - should not be parsed. So using ^ and $ is not a solution.

red-X · Accepted Answer · 2010-10-14 08:57:47Z

1

added ^ at the beginning and $ at the end, nothing comes before http and after the normal url

Regex.Replace(text, @"^((www\.|(http|https|ftp)\://)[.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])$",
                            "<a href=\"$1\" target=\"&#95;blank\">$1</a>", RegexOptions.IgnoreCase).Replace("href=\"www.", "href=\"http://www.");

answered Oct 14, 2010 at 8:57

red-X

5,1281 gold badge28 silver badges40 bronze badges

Comments

Kobi · Accepted Answer · 2010-10-14 08:59:22Z

0

Since the it seems the url is part a part or a block of text, use \b for word boundary:

Regex.Replace(text, @"\b((www\.| ... "

Your second question is a bit more tricky - have you considered using the same regex for both tasks?

answered Oct 14, 2010 at 8:59

Kobi

139k41 gold badges259 silver badges302 bronze badges

3 Comments

Alex Over a year ago

Looks like that's what I need. But how can I exclude the word?

Kobi Over a year ago

@Alex - I gave it some thought, and it isn't so simple. You could use (?<=\[url\]) before the regex (negative look behind), but it wouldn't work for [url]http://www.example.com[/url] - which will capture www.example.com. As I've said, you may need to write a small parser for that, so you can parse these tokens first, and let the regex handle the rest.

Alex Over a year ago

Ok, thanks. I'll try to find something about BB code parsers online.

Collectives™ on Stack Overflow

How to do advanced URL parsing with RegEx?

3 Answers 3

3 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related