curl returns 404 on valid page

Question

I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.

It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.

Here's an example URL that returns a 404:

https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

What am I doing wrong?

function check_url($url){
  $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
  curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
  $result = curl_exec($curl);
  if ($result == false) {
      //There was no response
      $message = "No information found for that URL";
      } else {
      //What was the response?
      $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
      if ($statusCode == 404) {
        $message = "No information found for that URL";
        } else{
        $message = "Good";
        }
      }
  return $message;
  }

roberto06 · Accepted Answer · 2016-11-18 13:49:53Z

2

The problem seems to come from you CURLOPT_NOBODY option.

I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.

The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD, my guess is that the server on which bostonglobe.com is hosted doesn't support that method.

answered Nov 18, 2016 at 13:49

roberto06

3,8641 gold badge20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dave Over a year ago

Ugh...dumb mistake. Thanks for salving the mystery, Roberto!

Raymond Nijland Over a year ago

bostonglobe.com does support http HEAD request.. i also tested the code with CURLOPT_NOBODY and it works fine on my localhost server.. but it looks like a firewall issue

Sourav Ghosh · Accepted Answer · 2016-11-18 21:04:01Z

I checked this URL with curl command.

curl --head https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

It returned an error .(HTTP/1.1 404 Not Found)

I also used another command use wget. The result was same.

wget –server-response --spider https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

I also checked this case with web service ( HTTP request generator: http://web-sniffer.net/ ). The result was same.

Other URL cases in https://www.bostonglobe.com/ work for HEAD request only. but i think post page (extension .html) is not support head request.

server administrator or programmer shutdown head request?

for php,

if($_SERVER["REQUEST_METHOD"] == "HEAD"){
    // response 404 or using header method to redirect 
    exit;
}

or server soft(Apache and more) limit the HTTP request.

for example, this purpose is to reduce server load.

Collectives™ on Stack Overflow

curl returns 404 on valid page

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related