1

Could someone please help me with a regular expression (I need it in php and in js) to remove http:// and www. from the beginning of a url string and remove the trailing / if its there.

For Example

  • http://www.google.com/ would be google.com
  • https://yahoo.com?page=1 would be yahoo.com?page=1
  • fancysite.com/articles/2012/ would be fancysite.com/articles/2012

Heres the code Im using for the JS side:

row.page_href.replace(/^(https?|ftp):\/\//, '')

And heres the code Im using for the php side:

$urlString = rtrim($urlString, '/');
$urlString = preg_replace('~^(?:https?://)?(?:www[.])?~i', '', $urlString);

As you can see the JS regex only removes http:// currently and the php requires two steps to do everything.

9
  • 1
    Why don't you add the www to the JS regex? Or why don't you use the the same in both cases? I don't think PHP requires you trim a possible / from the end of the string... that's just how you choose to do it. Commented Dec 28, 2012 at 16:40
  • May i ask why? is this for just anchor text? Commented Dec 28, 2012 at 16:40
  • 1
    The right regular expression will work in both JS and PHP. Commented Dec 28, 2012 at 16:41
  • 1
    Its a requirement for my project... Why are you questioning why I need something? And no this isn't for anchor text at all. Commented Dec 28, 2012 at 16:42
  • But... what's the problem with ^(?:https?://)?(?:www[.])?? Looks fine to me, just use it in JS and PHP. Commented Dec 28, 2012 at 16:44

2 Answers 2

4
function cleanUrl($url)
{
  if (($d= parse_url($url)) !== false) // valid url
  {
    return sprintf('%s%s%s',
      ltrim($d['host'], 'www.'),
      rtrim($d['path']. '/'),
      !empty($d['query']) ? '?'.$d['query'] : '');
  }
  return $url;
}

I would take advantage of parse_url (validate the url along with 'clean' it)

Sign up to request clarification or add additional context in comments.

3 Comments

Uh-duh, why didn't I think of that. I always forget about that function for soem reason. Use this OP.
I was going with the regex because I assumed it was faster than parsing and trimming the URL. Am I mistaken in my assumption?
@RachelD: Regex requires more overhead than the php's plain string parser. For this reason, I consider regex more overhead than is necessary.
0

#(https?(://))?(www.?)?(.*)#i

Worked just fine for me. You could change the last (.*) to match the RFC standards of a URL.

Outputs:

david@david-desktop ~ $ php -a
Interactive shell

php > $str = preg_replace('#(https?(://))?(www.?)?(.*)#i', '$4', 'https://www.google.ca');
php > echo $str . PHP_EOL;
google.ca
php > $str = preg_replace('#(https?(://))?(www.?)?(.*)#i', '$4', 'https://google.ca');
php > echo $str . PHP_EOL;
google.ca
php > $str = preg_replace('#(https?(://))?(www.?)?(.*)#i', '$4', 'http://google.ca');
php > echo $str . PHP_EOL;
google.ca
php > 

1 Comment

Thank you I had something very similar to this but it wasn't doing what I wanted so I thought I was on the wrong trail. I will play with this more.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.