2

I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.

It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.

Here's an example URL that returns a 404:

https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

What am I doing wrong?

function check_url($url){
  $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
  curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
  $result = curl_exec($curl);
  if ($result == false) {
      //There was no response
      $message = "No information found for that URL";
      } else {
      //What was the response?
      $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
      if ($statusCode == 404) {
        $message = "No information found for that URL";
        } else{
        $message = "Good";
        }
      }
  return $message;
  }

2 Answers 2

2

The problem seems to come from you CURLOPT_NOBODY option.

I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.

The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD, my guess is that the server on which bostonglobe.com is hosted doesn't support that method.

Sign up to request clarification or add additional context in comments.

2 Comments

Ugh...dumb mistake. Thanks for salving the mystery, Roberto!
bostonglobe.com does support http HEAD request.. i also tested the code with CURLOPT_NOBODY and it works fine on my localhost server.. but it looks like a firewall issue
1

I checked this URL with curl command.

curl --head https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

It returned an error .(HTTP/1.1 404 Not Found)

I also used another command use wget. The result was same.

wget –server-response --spider https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

I also checked this case with web service ( HTTP request generator: http://web-sniffer.net/ ). The result was same.

Other URL cases in https://www.bostonglobe.com/ work for HEAD request only. but i think post page (extension .html) is not support head request.

server administrator or programmer shutdown head request?

for php,

if($_SERVER["REQUEST_METHOD"] == "HEAD"){
    // response 404 or using header method to redirect 
    exit;
}

or server soft(Apache and more) limit the HTTP request.

for example, this purpose is to reduce server load.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.