1

I need to extract only parts of a URL with PHP but I am struggling to the set point where the extraction should stop. I used a regex to extract the entire URL from a longer string like this:

$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
preg_match_all($regex, $href, $matches);

The result is the following string:

http://www.cambridgeenglish.org/test-your-english/&sa=U&ei=a4rbU8agB-zY0QWS_IGYDw&ved=0CFEQFjAL&usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg

Now I want to extract only this bit http://www.cambridgeenglish.org/test-your-english/. I basically need to get rid off everything starting at &amp onwards.

Anyone an idea how to achieve this? Do I need to run another regex or can I add it to the initial one?

1
  • I think I am going with Avinash Raj's solution below. Works well for me. Thank you for your comment! Commented Aug 1, 2014 at 13:02

2 Answers 2

5

I would suggest you abandon regex and let PHP's own parse_url function do this for you:

http://php.net/manual/en/function.parse-url.php

$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . $parsed['path'];

to get the substring of the path up to the &amp, try:

$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . substr($parsed['path'], 0, strpos($parsed['path'],'&amp'));
Sign up to request clarification or add additional context in comments.

2 Comments

This is interesting but the path will still contain the part after &amp so it doesn't really solve my initial problem.
updated the answer for you, i think that should do what you need
2

The below regex would get ridoff everything after the string &amp. Your php code would be,

<?php
echo preg_replace('~&amp.*$~', '', 'http://www.cambridgeenglish.org/test-your-english/&amp;sa=U&amp;ei=a4rbU8agB-zY0QWS_IGYDw&amp;ved=0CFEQFjAL&amp;usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg');
?> //=> http://www.cambridgeenglish.org/test-your-english/

Explanation:

  • &amp Matches the string &amp.
  • .* Matches any character zero or more times.
  • $ End of the line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.