0

I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?

The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?

$links = array ('
    http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
    http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
    http://www.isbndb.com/book/waterworks_a02','
    http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
    http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'

    ); // array of URLs

foreach ($links as $link)
{

    $page = file_get_contents($link);
    #print $page;

                preg_match("@<h1 itemprop='name'>(.*?)</h1>@is",$page,$title);
                preg_match("@<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>@is",$page,$publisher);
                preg_match("@<span>ISBN10: <span itemprop='isbn'>(.*?)</span>@is",$page,$isbn10);
                preg_match("@<span>ISBN13: <span itemprop='isbn'>(.*?)</span>@is",$page,$isbn13);
                        echo '<tr>
                        <td>'.$title[1].'</td>
                        <td>'.$publisher[2].'</td>
                        <td>'.$isbn10[1].'</td>
                        <td>'.$isbn13[1].'</td>
                        </tr>'; 
                        #exit();                                    

            }
2
  • 4
    There's a newline before each of your URLs, could that be causing the issue? Commented Sep 12, 2014 at 13:48
  • 1
    Never parse html with regex stackoverflow.com/questions/1732348/… Commented Sep 12, 2014 at 13:50

1 Answer 1

2

My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.

The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.

Of course you should trim your urls beforehands just to be sure it's not the problem.

Example here:

$curl = curl_init();
foreach ($links as $link) {

   curl_setopt($curl, CURLOPT_URL, $link);
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
   curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
   curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects

   $result = curl_exec($curl);
   if (! $result) {
      continue; // if $result is empty or false - ignore and continue;
   }

   // do what you need to do here
}
curl_close($curl);
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.