0

I'm attempting to detect all URLs listed in a free text block. I'm using the .nets Regex.Matches call.. with the following regex: (http|https)://[^\s "']{4,}

Now, I've put in the following text:
here is a link http://somelink.com
here is a link that I didn't space withhttp://nospacelink.com/something?something=&39358235
http://nospacelink.com/something?something=&12233454
here is a link I already handled. Here is some secret t&cs you're not allowed to know https://somethingbad.com
Just to be a little annoying I've put in a new address thingy capture type of 'http://somethinginspeechmarks.com' and what are you going to do now?
here is a link http://postTextLink.com at then some post text
Here is a link with a full stop http://alinkwithafullstoplink.com. And then some more.

and I get the following output:

http://somelink.com
http://nospacelink.com?something=&39358235
http://nospacelink.com?something=&12233454
http://alreadyhandledlink.com
https://somethingbad.com
http://somethinginspeechmarks.com
http://postTextLink.com
http://alinkwithafullstoplink.com.

Please notice the full stop on the last entry. How can I update my regex to say "If there is a full stop at the end, please ignore it?"

Also, please note that "Getting parts of a URL (Regex)" has nothing to do with my question, as that question is about how to break down a particular URL. I want to extract multiple, complete urls. Please see my input and current outputs for clarification! I have got a regex already that does most of what I want, but isn't quite right. Could you please explain where my approach might be improved?

5
  • Loving that I can't just mark my own question as duplicate lol Commented May 16, 2014 at 13:51
  • Change to (http|https)://[^\s "']{4,}(?<!\.) - added (?<!\.) in the end. Commented May 16, 2014 at 14:05
  • @Kilazur, I was meaning I could only vote it as a duplicate, as apposed to just closing it as duplicate... Commented May 16, 2014 at 14:11
  • @smerny, could you provide an example of a url which wouldn't pass with (http|https)://[^\s "']{4,}[^\.\s"']+? Commented May 16, 2014 at 14:14
  • @ImmortalBlue, that regex isn't in the answer marked as duplicate Commented May 16, 2014 at 14:48

2 Answers 2

1

I would add something like [^\.] to the pattern.

This pattern says that the last char can't be a full stop.

So for (http|https)://[^\s "']{4,}[^\.] it will try to match all adresses not ending with a full stop.

Edit:

This one should be better as said in comments: [^.\s"']

Sign up to request clarification or add additional context in comments.

3 Comments

This would actually match http://alinkwithafullstoplink.com. (with an extra space at the end) as well as http://somethinginspeechmarks.com' (the quote mark)
Exact ! then [^\.\s"']+
User wants dots, just not at the end.
-1

Updated:

Consider the following minor change to your pattern:

(http|https)://[^\s "']{4,}(?=\.)

4 Comments

that stops at any ., so gives the output of http://somelink and http://nospacelink etc...
Fixed the pattern. Try that for size.
That still returns the . at the end... http://linkwithafullstop.com. http://alinkwithafullstoplink.com.
Made a minor change to the pattern... . to \.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.