5

Final Update It appears that the targeted website blocked DO IPs and are giving the problems which I've been resolving for days. I spinned a EC2 instance and manage to work the code working, together with caching etc so as to reduce the hit on the website and allow my user to share the website.

-

UPDATE: I manage to get the Html by setting curl error to off, however the website other than returning 405 error is also not setting some cookies which are required for the website content to be loaded.

curl_setopt($ch, CURLOPT_FAILONERROR, FALSE);

I'm using the following codes for ajax->PHP to retrieve og: meta for websites. However, there's 1 or 2 specific sites that returns error and would not retrieve the info. With the following errors. The code works seamlessly for majority of the websites.

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /my/home/path/getUrlMeta.php on line 58

From curl_error in my error_log

The requested URL returned error: 405 Not Allowed

And

Failed to connect to www.something.com port 443: Connection refused

I have no problems getting the html of the website when I use curl on my server console and no problem retrieving information needed for majority of the websites using codes below

function file_get_contents_curl($url)
{
    $ch = curl_init();
    $header[0] = "Accept: text/html, text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: no-cache";
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    //curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 " );
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    //The following 2 set up lines work with sites like www.nytimes.com

    //Update: Added option for cookie jar since some websites recommended it. cookies.txt is set to permission 777. Still doesn't work.
    $cookiefile = '/home/my/folder/cookies.txt';
    curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
    curl_setopt( $ch, CURLOPT_COOKIEJAR,  $cookiefile );
    curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookiefile );

    $data = curl_exec($ch);

  if(curl_error($ch))
    {
        error_log(curl_error($ch));
    }
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl($url);

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
    $property = substr($meta->getAttribute('property'),3);
    $content = $meta->getAttribute('content');
    $rmetas[$property] = $content;
}

/*below code retrieves the next bigger than 600px image should og:image be empty.*/
if (empty($rmetas['image'])) {
    //$src = $xpath->evaluate("string(//img/@src)");
    //echo "src=" . $src . "\n";
    $query = '//*/img';
    $srcs = $xpath->query($query);
    foreach ($srcs as $src) {

        $property = $src->getAttribute('src');


        if (substr($property,0,4) == 'http' && in_array(substr($property,-3), array('jpg','png','peg'), true)) {
            if (list($width, $height) = getimagesize($property)) {
            do if ($width > 600) {
                $rmetas['image'] = $property;
                break;
            } while (0);
            }
        }

    }
}

echo json_encode($rmetas);


die();

UPDATE: Error on my part that website is not https enabled so I still have the 405 not allowed error.

curl info

{
    "url": "http://www.example.com/",
    "content_type": null,
    "http_code": 405,
    "header_size": 0,
    "request_size": 458,
    "filetime": -1,
    "ssl_verify_result": 0,
    "redirect_count": 0,
    "total_time": 0.326782,
    "namelookup_time": 0.004364,
    "connect_time": 0.007725,
    "pretransfer_time": 0.007867,
    "size_upload": 0,
    "size_download": 0,
    "speed_download": 0,
    "speed_upload": 0,
    "download_content_length": -1,
    "upload_content_length": -1,
    "starttransfer_time": 0.326634,
    "redirect_time": 0,
    "redirect_url": "",
    "primary_ip": "SOME IP",
    "certinfo": [],
    "primary_port": 80,
    "local_ip": "SOME IP",
    "local_port": 52966
}

Update: If I do a curl -i from console I get the following response. A error 405 but it follows by all the HTML that I need.

Home> curl -i http://www.domain.com
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Wed, 22 Feb 2017 17:57:03 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Vary: Accept-Encoding
Vary: Accept-Encoding
Set-Cookie: PHPSESSID2=ko67tfga36gpvrkk0rtqga4g94; path=/; domain=.domain.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: __PAGE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
Set-Cookie: __PAGE_SITE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com
X-Repository: legacy
X-App-Server: production-web23:8018
X-App-Server: distil2-kvm:80
4
  • If it only stops working on some sites, this is a server-side problem. Nothing we can do to help. Commented Feb 22, 2017 at 18:07
  • @miken32 but the URL is accessible from web browser. Doesn't curl emulates a browser? It's a publicly accessible website that requires no login, no ssl etc Commented Feb 22, 2017 at 18:08
  • Remove the CURLOPT_FAILONERROR and you'll get the full contents for the 405 just like the command line equivalent you show. Commented Feb 23, 2017 at 7:57
  • Hi Daniel, I did it a few minutes before you post your comment. However, I do not know how the website manage to detect me as a crawler, when I've sent all the headers. The HTML returned when faillonerror is false, does not contain any real content of the website. Apparently with the 405 error, visitor cookie is not set, so the webpage does not display the content Commented Feb 23, 2017 at 8:00

3 Answers 3

4

Since I was looking for solution myself, and no answer was given on the comments: in my case the problem was:

     curl_setopt($ch, CURLOPT_NOBODY, 1);

Simply. It sends HEAD method which might be not recognized/not supported by server - therefore You get 405.

Sign up to request clarification or add additional context in comments.

1 Comment

This helped me with a problem checking Instagram Urls. For most sites setting CURLOPT_NOBODY to true is helpful as it saves time when checking lots of links. Bit for Instagram you get a 405. Instead I check with CURLOPT_NOBODY first and if I get a 405 I check again with CURLOPT_NOBODY set to false.
3

Add the following to your code to help debug the issue:

$info = curl_getinfo($ch);
print_r( $info );

More than likely, the issues are as follows:

  • 405 Not Allowed - the cURL call you are trying to make it not allowed. e.g. Making a GET call, when only POST is permitted.
  • 443: Connection refused - the site you are trying to access does not support HTTPS. Or, the site is using cryptographic protocols not supported by your code, e.g. using only TLSv1.2, while you code may be using TLSv1.1.

9 Comments

I've added the curl_getinfo in my question. The website is a publicly accessible site, and I'm trying to get the og tag when users share the website url in my application (think Facebook url sharing).
turn out the website doesn't use HTTPS so i do not need to fix the connection refuse error, but i still can't get the 405 error resolved.
Have you tried to access URL casing 405 using browser? Are GET requests allowed for this URL?
that is what got my head scratching. Accessing from website is good, curl from commandline in server is good without any error. I can also share the website from facebook. ALSO, I've tried the same codes from multiple servers from different location and it all return 405.
i've also modified curl options to enable cookies and cookiejar but still to no avail.
|
3

Since these solutions didn't work for me, I'll post my solution here:

I added this line and stopped receiving error 405. It's all about 'GET' requests.

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.