1

I'm writing a crawler with PHP that reads the HTML and stores it in a variable. The code works great if the site doesn't have a redirect. If I crawl the Google, for example, I have the following:

CURL Result

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.com.br/?gfe_rd=cr&amp;ei=A14yVviJCuyp8wfmyIfIBg">here
</A>.
</BODY></HTML>

PHP method

private function parseHTML($url){
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, array('X-Apple-Tz: 0', 'X-Apple-Store-Front: 143444,12'));
    ob_start();
    curl_exec($curl); 
    curl_close($curl);
    $html = ob_get_contents();
    ob_end_clean();
    return $html;
}

How can I redirect to the destination page, crawl the HTML and return the code?

1
  • When you get that 302 page content. Is the HTTP Status header also set to 302? Commented Oct 29, 2015 at 19:47

1 Answer 1

2

If the server would redirect your call, setting the CURLOPT_FOLLOWLOCATION option would do the trick, maybe in conjunction with CURLOPT_MAXREDIRS option to limit the number of redirects. see php's curl_setopt method

i.e.

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);

However considering in the provided example, the server is not redirecting you (your curl's request) and instead gives you (the user) some information, I'm afraid your application has to read and digest the content and does the appropriate redirection accordingly.

Sign up to request clarification or add additional context in comments.

2 Comments

There is nothing to say that there is not a 302 header sent along with that content when 302 occurs such that OP could use the curl options that you rightfully suggest. They would need to look at the response headers to see if they are truly getting a 302. It is very common for a web server to serve custom error content along with sending an appropriate response header. You especially see this for 404 responses.
You are right @MikeBrant , thanks for the input. In which case, we could also take advantage of CURLOPT_POSTREDIR option to identify if it is a 302

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.