0

I am trying to validate a Youtube URL using regex:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+~', $videoLink)

It kind of works, but it can match URL's that are malformed. For example, this will match ok:

http://www.youtube.com/watch?v=Zu4WXiPRek

But so will this:

http://www.youtube.com/watch?v=Zu4WX£&P!ek

And this wont:

http://www.youtube.com/watch?v=!Zu4WX£&P4ek

I think it's because of the + operator. It's matching what seems to be the first character after v=, when it needs to try and match everything behind v= with [a-zA-Z0-9-]. Any help is appreciated, thanks.

5
  • What you have looks fine. Are £, & ! valid characters in the YouTube string? If so, add them to your [a-zA-Z0-9-] char class, otherwise, isn't it working as intended? Commented Sep 17, 2010 at 17:46
  • The + btw, means: match any of these characters: [a-zA-Z0-9-] one or more times, so it will keep going until it hits something not in there. Commented Sep 17, 2010 at 17:47
  • The only characters allowed in a Youtube video ID like this is a-z, A-Z, 0-9 and -. Hence why I put [a-zA-Z0-9-]. It's not working as intended. I can submit URLs like: v=Zu4WX£&P!ek (in this case £, & and ! are illegal characters) and it matches them fine because it's only checking the first character after v=. Commented Sep 17, 2010 at 17:49
  • it would help if you provide more context. Where are you getting the url from? Is it from a full page scrape and the urls are in an href="..."? You could do like [a-zA-Z0-9-]+("|') do you already have the list of urls parsed and looping through them? Commented Sep 17, 2010 at 17:55
  • The URL is being submitted through a form by the user, and I need to check that it is a valid Youtube URL before I send off requests to the page. Commented Sep 17, 2010 at 17:57

6 Answers 6

3

To provide an alternative that is larger and much less elegant than a regex, but works with PHP's native URL parsing functions so it might be a bit more reliable in the long run:

 $url = "http://www.youtube.com/watch?v=Zu4WXiPRek";

 $query_string = parse_url($url, PHP_URL_QUERY); // v=Zu4WXiPRek

 $query_string_parsed = array();                        
 parse_str($query_string, $query_string_parsed); // an array with all GET params

 echo($query_string_parsed["v"]); // Will output Zu4WXiPRek that you can then
                                  // validate for [a-zA-Z0-9] using a regex
Sign up to request clarification or add additional context in comments.

3 Comments

just want to point out that this is only really useful (and IMO recommended) if you already have just the url...but not really if he's scraping a page for urls...
That just seems like added code going back to the original problem. The problem is with validating the string after v=, which is what this code extracts. I don't need it extracted, I just need to make sure the rest of the URL after v= is matched by [a-zA-Z0-9-].
@Will yeah. This is a more standards-conformant way that can deal with changing URL structures to some extent. For example, it doesn't break when a URL has the popular &fmt=18 parameter. Anyway, it's just an alternative suggestion; as far as I can see, @lonesomeday answers your speficic question
0

The problem is that you are not requiring any particular number of characters in the v= part of the URL. So, for instance, checking

http://www.youtube.com/watch?v=Zu4WX£&P!ek

will match

http://www.youtube.com/watch?v=Zu4WX

and therefore return true. You need to either specify the number of characters you need in the v= part:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]{10}~', $videoLink)

or specify that the group [a-zA-Z0-9-] must be the last part of the string:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+$~', $videoLink)

Your other example

http://www.youtube.com/watch?v=!Zu4WX£&P4ek

does not match, because the + sign requires that at least one character must match [a-zA-Z0-9-].

2 Comments

I'm pretty sure the v= part varies, that's why I didn't use that before... and using [a-zA-Z0-9-]$ didn't work either. It's just returning false for everything.
Thats because it should have been: [a-zA-Z0-9-]+$ just a typo.
0

Short answer:

preg_match('%(http://www.youtube.com/watch\?v=(?:[a-zA-Z0-9-])+)(?:[&"\'\s])%', $videoLink)

There are a few assumptions made here, so let me explain:

  • I added a capturing group ( ... ) around the entire http://www.youtube.com/watch?v=blah part of the link, so that we can say "I want get the whole validated link up to and including the ?v=movieHash"
  • I added the non-capturing group (?: ... ) around your character set [a-zA-Z0-9-] and left the + sign outside of that. This will allow us to match all allowable characters up to a certain point.
  • Most importantly, you need to tell it how you expect your link to terminate. I'm taking a guess for you with (?:[&"\'\s])

    ?) Will it be in html format (e.g. anchor tag) ? If so, the link in href will obviously end with a " or '.
    ?) Or maybe there's more to the query string, so there would be an & after the value of v.
    ?) Maybe there's a space or line break after the end of the link \s.

The important piece is that you can get much more accurate results if you know what's surrounding what you are searching for, as is the case with many regular expressions.

This non-capturing group (in which I'm making assumptions for you) will take a stab at finding and ignoring all the extra junk after what you care about (the ?v=awesomeMovieHash).

Results:

http://www.youtube.com/watch?v=Zu4WXiPRek
 - Group 1 contains the http://www.youtube.com/watch?v=Zu4WXiPRek

http://www.youtube.com/watch?v=Zu4WX&a=b
 - Group 1 contains http://www.youtube.com/watch?v=Zu4WX

http://www.youtube.com/watch?v=!Zu4WX£&P4ek
 - No match

a href="http://www.youtube.com/watch?v=Zu4WX&size=large"
 - Group 1 contains http://www.youtube.com/watch?v=Zu4WX

http://www.youtube.com/watch?v=Zu4WX£&P!ek
 - No match

Comments

0

The "v=..." blob is not guaranteed to be the first parameter in the query part of the URL. I'd recommend using PHP's parse_url() function to break the URL into its component parts. You can also reassemble a pristine URL if someone began the string with "https://" or simply used "youtube.com" instead of "www.youtube.com", etc.

function get_youtube_vidid ($url) {
    $vidid = false;
    $valid_schemes = array ('http', 'https');
    $valid_hosts = array ('www.youtube.com', 'youtube.com');
    $valid_paths = array ('/watch');

    $bits = parse_url ($url);
    if (! is_array ($bits)) {
        return false;
    }
    if (! (array_key_exists ('scheme', $bits)
            and array_key_exists ('host', $bits)
            and array_key_exists ('path', $bits)
            and array_key_exists ('query', $bits))) {
        return false;
    }
    if (! in_array ($bits['scheme'], $valid_schemes)) {
        return false;
    }
    if (! in_array ($bits['host'], $valid_hosts)) {
        return false;
    }
    if (! in_array ($bits['path'], $valid_paths)) {
        return false;
    }
    $querypairs = explode ('&', $bits['query']);
    if (count ($querypairs) < 1) {
        return false;
    }
    foreach ($querypairs as $querypair) {
        list ($key, $value) = explode ('=', $querypair);
        if ($key == 'v') {
            if (preg_match ('/^[a-zA-Z0-9\-_]+$/', $value)) {
                # Set the return value
                $vidid = $value;
            }
        }
    }

    return $vidid;
}

Comments

0

Following regex will match any youtube link:

$pattern='@(((http(s)?://(www\.)?)|(www\.)|\s)(youtu\.be|youtube\.com)/(embed/|v/|watch(\?v=|\?.+&v=|/))?([a-zA-Z0-9._\/~#&=;%+?-\!]+))@si';

2 Comments

It doesn't work on youtube-nocookie.com URLs, nor does it work on URLs with a query string like ?v=0123456789a&q=18#t=12s.
Also, your character class has an inverted class range ?-\. Which means it won't work with many regex flavors, including PHP preg.
-1

If you'd like to cover all YouTube URL variants try this:

^(?:(?:https?:)?\/\/)?(?:(?:(?:www|m(?:usic)?)\.)?youtu(?:\.be|be\.com)\/(?:shorts\/|live\/|v\/|e(?:mbed)?\/|watch(?:\/|\?(?:\S+=\S+&)*v=)|oembed\?url=https?%3A\/\/(?:www|m(?:usic)?)\.youtube\.com\/watch\?(?:\S+=\S+&)*v%3D|attribution_link\?(?:\S+=\S+&)*u=(?:\/|%2F)watch(?:\?|%3F)v(?:=|%3D))?|www\.youtube-nocookie\.com\/embed\/)([\w-]{11})[\?&#]?\S*$

It's a RegExp from a related question for any known YouTube URL (also music.*, shorts/, live/, e/ embed/, v/, *-nocookie etc.). Doesn't catch these:

  (wrong ID)
youtube.com/watch?v=U$t-slLl30E
  (too short ID)
youtube.com/watch?v=U9t-slLl30&t=10
  (wrong or deprecated paths)
youtube.com/GitHub?v=U9t-slLl30E
youtube.com/?v=U9t-slLl30E
youtube.com/?vi=U9t-slLl30E
youtube.com/?feature=player_embedded&v=U9t-slLl30E
youtube.com/watch?vi=U9t-slLl30E
youtube.com/vi/U9t-slLl30E
  (www.youtube-nocookie.com/embed/ only!)
youtube-nocookie.com/embed/U9t-slLl30E
www.youtube-nocookie.com/watch?v=U9t-slLl30E
http://www.youtube-nocookie.com/v/U9t-slLl30E?version=3&hl=en_US&rel=0
  (playlist)
youtube.com/playlist?list=PLmXxqSJJq-yVWpRFGImHYZBQTuBGLjG4t

You can try it here: https://regex101.com/r/7upRfP/. Also catches video ID.

If you want you can restrict the video ID further with Glenn's answer instead of ([\w-]{11}).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.