0

I'm trying to extract one or more urls from a plain text string in php. Here's some examples

"mydomain.com has hit the headlines again"

extract " http://www.mydomain.com"

"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"

extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"

There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com

p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.

Thanks

1
  • 3
    mydomain.com != http://www.mydomain.com Commented Nov 6, 2010 at 9:36

1 Answer 1

4

In this case it will be hard to get 100% correct results. Depending on the input you may try to force matching just most popular first level domains (add more to it):

(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b

You may need to remove the word boundary (\b) to get different results.

You can test it here:

http://bit.ly/dlrgzQ

EDIT: about your cases 1) remove from what? 2) this could be done in php like:

 $result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);

But I have few important notes:

  • This Regex are more like guidance, not actual production code
  • Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:

http://example.org

but not!

example.org

  • It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.

Also get interested in: http://htmlpurifier.org/

Sign up to request clarification or add additional context in comments.

1 Comment

+1, but you might want to add [a-z]{2} as an alternative top level domain to allow international and special domains like amazon.de, apple.tv etc. (and drop uk and ly from the list). If you want to match domains like these.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.