0

I'm looking to create a PHP script where, a user will provide a link to a webpage, and it will get the contents of that webpage and based on it's contents, parse the contents.

For example, if a user provides a YouTube link:

http://www.youtube.com/watch?v=xxxxxxxxxxx

Then, it will grab the basic information about that video (thumbnail, embed code?)

Or they might provide a vimeo link:

 http://www.vimeo.com/xxxxxx

Or even if they were to provide any link, without a video attached, such as:

 http://www.google.com/

And it could grab just the page Title or some meta content.

I'm thinking I'd have to use file_get_contents, but I'm not exactly sure how to use it in this context.

I'm not looking for someone to write the entire code, but perhaps provide me with some tools so that I can accomplish this.

1
  • 3
    Try to ask a more straight forward question, like "how do I get the thumbnails of a movie in youtube using PHP" It might make people more responsive. Commented Sep 5, 2009 at 20:17

4 Answers 4

3

You can use either the curl or the http library. You send a http request, and can use the library to get the information from the http response.

Sign up to request clarification or add additional context in comments.

1 Comment

in addition, you can use regex to parse the information you want ftom those websites.
2

I know this question is quite old, but I'll answer just in case someone hits it looking for the same thing.

Use oEmbed (http://oembed.com/) for YouTube, Vimeo, Wordpress, Slideshare, Hulu, Flickr and many other services. If not in the list or you want to make it more precise, you can use this:

http://simplehtmldom.sourceforge.net/

It's a sort of jQuery for PHP, meaning you can use HTML selectors to get portions of the code (i.e.: all the images, get the contents of a div, return only text (no HTML) contents of a node, etc).

You could do something like this (could be done more elegantly but this is just an example):

    require_once("simple_html_dom.php");
function getContent ($item, $contentLength) 
{
    $raw;
    $content = "";
    $html;
    $images = "";

    if (isset ($item->content) && $item->content != "")
    {
        $raw = $item->content;
        $html = str_get_html ($raw);            
        $content = str_replace("\n", "<BR /><BR />\n\n", trim($html->plaintext));

        try
        {
            foreach($html->find('img') as $image) {
                if ($image->width != "1") 
                {
                    // Don't include images smaller than 100px height
                    $include = false;
                    $height = $image->width;
                    if ($height != "" && $height >= 100)
                    {
                        $include = true;
                    }
                    /*else
                    {
                        list($width, $height, $type, $attr) = getimagesize($image->src);
                            if ($height != "" && $height >= 100)
                                $include = true;
                    }*/                 

                    if ($include == true)
                    {
                        $images = $images . '<div class="theImage"><a href="'.$image->src.'" title="'.$image->alt.'"><img src="'.$image->src.'" alt="'.$image->alt.'" class="postImage" border="0" /></a></div>';
                    }
                }
            }
        }
        catch (Exception $e) {
            // Do nothing
        }

        $images = '<div id="images">'.$images.'</div>';
    }
    else
    {
        $raw = $item->summary;
        $content = str_get_html ($raw)->plaintext;
    }

    return (substr($content, 0 , $contentLength) . (strlen ($content) > $contentLength ? "..." : "") . $images);
}

Comments

1

file_get_contents() would work in this case assuming that you have allow_fopen_url set to true in your php.ini. What you would do is something like:

$pageContent = @file_get_contents($url);
if ($pageContent) {
    preg_match_all('#<embed.*</embed>#', $pageContent, $matches);
    $embedStrings = $matches[0];
}

That said, file_get_contents() won't give you much in the way of error handling other receiving the content on success or false on failure. If you would like to have more rich control over the request and access the HTTP response codes, use the curl functions and in particular, curl_get_info, to look at the response codes, mime types, encoding, etc. Once you get the content via either curl or file_get_contents() your code for parsing it to look for the HTML of interest will be the same.

1 Comment

After a call to file_get_contents using the HTTP wrapper (so opening a URL), the variable $http_response_header will be populated with the response-headers
0

Maybe Thumbshots or Snap already have some of the functionality you want?

I know that's not exactly what you are looking for, but at least for the embedded stuff that might be handy. Also txwikinger already answered your other question. But maybe that helps ypu anyway.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.