0

i'm using this example code to start with parsing aspecial website:

<?php

# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);

# Create a DOM parser object
$dom = new DOMDocument();

# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
        # Show the <a href>
        echo $link->getAttribute('href');
        echo "<br />";
}
?>

Source

Then i changed the above url to removed for privacy reasons and run the script again, but no i got no output, but with the google-URL it will work. So what's the problem with my website? Are the protection methods to avoid the parsing or is the page not conform to the standard? Hope someone could help me.

2
  • 1
    Try outputting the HTML and see what it returns. Also take a look at the HTTP response headers. With that said, in all likelihood if the URL works in your browser and not in curl, it's probably because it rejects requests with no user agent set. I've seen this before a few times. Commented Dec 16, 2018 at 0:00
  • Is your curl extension enabled? I can retrieve links using your code Commented Dec 16, 2018 at 1:45

1 Answer 1

1

It looks like that site returns only gzip encoded responses. So you need to set the correct cURL encoding and send the correct encoding headers:

$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
    'Accept-Encoding: gzip, deflate, br',
));
$html = curl_exec($ch);
curl_close($ch);

This is working on my end.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.