2

I'm trying to scrape this link: https://www.bu.edu/link/bin/uiscgi_studentlink/1293403322?College=SMG&Dept=AC&Course=222&Section=C1&Subject=ACCT &MtgDay=&MtgTime=&ModuleName=univschr.pl&KeySem=20114&ViewSem=Spring+2011&SearchOptionCd=C&SearchOptionDesc=Class+Subject&MainCampusInd=. (It works fine if you access it in the browser.)

So I cUrl it, using this code:

function curl_classes($url){
  $ch = curl_init();
  $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
  curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
  curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
  curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  echo "NOW IM REALY GOING TO: " . $url;
  curl_setopt($ch,CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_FAILONERROR, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
  curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 50);
  curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

  $html = curl_exec($ch);
  curl_close($ch);
  unset($ch);
  if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
  }
  echo htmlspecialchar($html);
} 

EDIT

Okay, new problem. My cookie storing code doesn't seem to be working. I'm able to scrape this like as desired: bu[DOT]edu/link/bin/uiscgi_studentlink/1293357973?ModuleName=univschr.pl&SearchOptionDesc=Class+Subject&SearchOptionCd=C&KeySem=20114&ViewSem=Spring+2011&Subject=ACCT&MtgDay=&MtgTime=

But when I try to scrape the link at the top of this post I get: "Sorry you need cookies enabled..."

What am I doing wrong in my cookie storing code?

3 Answers 3

2

I'm betting that you do access the HTML. It prints the HTML to the screen, and that HTML includes code that redirects you to a new page.

Try outputting an encoded version of the HTML, so that the browser interprets it as plain text:

echo htmlspecialchars($html);

However, looking at your actual code: please do not pretend to be Google. You are not the Googlebot, so your script should not say that you are. If you include any user agent at all (and I recommend that you do), make it reflect your identity, in case the site owner hits issues with your bot. No need to be shady :)

Sign up to request clarification or add additional context in comments.

2 Comments

Hmm. That fixed the redirect but now it's loading a page for non-cookie browsers without the class info. I'd appreciate it if you could read the post edit.
@Pauly: Some users report that they need to use absolute paths for their COOKIEJAR and COOKIEFILE entries to work :/
1

Since you're echoing the contents out in the browser, any javascript in the remote page will be executed. Presumably something is redirecting the page.

2 Comments

Is there a way to disable the javascript when echoing? Otherwise, how can I see what's in $html if not with echo?
just simply, echo the html in textarea :-)
0

You can write the html into a file and then open that in an editor if you have annoying javascript. Or just disable JS in your browser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.