2

I'm trying to get content of this page: http://www.nytimes.com/2014/01/26/us/politics/rand-pauls-mixed-inheritance.html?hp&_r=0

I tried file_get_contents and curl solution but all gives me a Login page of NYTimes and I have no idea why.

Tried these file_get_contents()/curl getting unexpected page, PHP file_get_contents() behaves differently to browser, file_get_content get the wrong web

Is there any solution? Thanks

EDIT:

    //this is the curl code I use
    $cookieJar = dirname(__FILE__) . '/cookie.txt';
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieJar);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieJar);
    curl_setopt($ch, CURLOPT_URL, $link);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026     Firefox/3.6.12');
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $data    = curl_exec($ch);
    curl_close($ch);
6
  • On the server you are running this code on, does "curl nytimes.com/2014/01/26/us/politics/…" output the right information? Commented Jan 29, 2014 at 19:15
  • They could be blocking access by domain (to prevent scraping) in their server settings such as .htaccess Commented Jan 29, 2014 at 19:15
  • Did you pass an agent? Commented Jan 29, 2014 at 19:15
  • nytimes is definitely blocking scrapers. You'll have to tinker with the cURL flags to get it to appear as if it's a browser. I'm not a cURL pro; I wish I could help more. Best of luck :) Commented Jan 29, 2014 at 19:21
  • @enigma I passed curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12'); Commented Jan 29, 2014 at 19:23

3 Answers 3

3

try to test it using saving cookies to same directory where the script resides first
so set the cookies path like that
$cookie = "cookie.txt";
this code works with me and i got the page

<?php
function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
  curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}
$get_page = curl_get_contents("http://www.nytimes.com/2014/01/26/us/politics/rand-pauls-mixed-inheritance.html?hp&_r=1");
echo $get_page;
   ?>
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, this works! The guy below you was faster, though your answer is better - more complex and I got it working just because of it.
glad to hear that, it's not complex i use a function to it reusable,
yes but there's full curl settings so I could copy paste it to see that it works. In my old curl settings I had CONNECTTIMEOUT set which made it malfunctioning.
you can add any more setting for this function as you want
I know, it just that timeout that prevented it to work. I didn't realize that it was that before you posted your answer.
1

I think you need cURL to allow cookies to be saved. Try adding these lines to the cURL setup. For me this worked:

$cookie = dirname(__FILE__) . "\cookie.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);

1 Comment

ok my bad, had CURLOPT_CONNECTTIMEOUT set... It works. Thanks.
0

Use Live HTTP Headers firefox plugin to check what is going on during page access. There can be redirections, cookie set etc. And then try to implement this behaviour with php curl (note: set user-agent as and other client headers the same as browser)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.