php : how to get all hyperlinks from a specific div of a given page?

Question

I'm trying to get all link URL of news on some div from this web

To get all link, after I view source but there is nothing.

But there are any data display

Could any that understand PHP, Array() and JS help me, please?

This is my code to get the content:

$html = file_get_contents("https://qc.yahoo.com/");
if ($result === FALSE) {
    die("?");
} 
echo $html;

I'm having a hard time understanding. It would help if you showed us a sample $html input, and what you would like to have when you're done processing. Just a small sample, enough that we understand what you're trying to do. — BeetleJuice
– BeetleJuice, Commented Jul 15, 2016 at 9:15
hy @BeetleJuice have u check stackoverflow.com/a/38396700/6516181 that what i mean, sorry im not advanced in coding & name of keyword. Please your help ^^ — sikuda
– sikuda, Commented Jul 16, 2016 at 2:15

unixmiah · Accepted Answer · 2016-07-18 05:55:01Z

3

$html = new DOMDocument();
@$html->loadHtmlFile('https://qc.yahoo.com/');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@id='news_moreTopStories']//a/@href" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

you can get all links from the divs you specify. make sure you put the div ids in id='news_moreTopStories']. you're using xpath to query the divs. you don't need a ton of code, just this portion.

http://php.net/manual/en/class.domxpath.php

edited Jul 18, 2016 at 5:55

answered Jul 18, 2016 at 5:27

unixmiah

3,1631 gold badge16 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sikuda Over a year ago

hy sir, thank you for helping us too, this will be add more solutions for me ^^

gregn3 Over a year ago

Yes this a better solution, but it doesn't seem to decode the gzip-ed content.

Community · Accepted Answer · 2017-05-23 12:33:08Z

2

Assuming, you want to extract all Anchor Tags with their hyperlinks from the given page.

Now there are certain problems with doing file_get_contents on that URL :

Character encoding for Compression, i.e gzip
SSL Verification of the URL.

So, to overcome first problem of gzip character encoding, we'll use CURL as @gregn3 suggested in his answer. But he missed to use CURL's ability to automatically decompress gziped content.

For second problem, you can either follow this guide or disable SSL verification from CURL's curl_setopt methods.

Now the code which will extract all the links from the given page is :

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt($c, CURLOPT_HTTPHEADER, ["Accept-Encoding:gzip"]);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($c, CURLOPT_ENCODING , "gzip");
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($c, CURLOPT_SSL_VERIFYHOST, 0);
$content = curl_exec ($c);

curl_close ($c);

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

But if you want to do advance html parsing, then you'll need to use PHP Simple HTML Dom Parser. In PHP Simple HTML Dom you can select the div by using jQuery selectors and fetch the anchor tags. Here are it's documentation & api manual.

edited May 23, 2017 at 12:33

CommunityBot

11 silver badge

answered Jul 18, 2016 at 4:57

Deepak Chaudhary

4865 silver badges17 bronze badges

11 Comments

gregn3 Over a year ago

Thanks @Deepak , I was not very familiar with CURL , but now I know about this too. :)

sikuda Over a year ago

no i like this. this make me more understand. thank you for describe & knowledge sir :* kiss hug .. #awesome btw what have you socmed, i want add you sir

Deepak Chaudhary Over a year ago

:) and Sorry, I don't know what socmed is.

sikuda Over a year ago

@DeepakChaudhary social media sir.. :3

Deepak Chaudhary Over a year ago

Ahh.. :D I'm not that active on socmed.

|

gregn3 · Accepted Answer · 2016-07-17 15:23:36Z

0

To find all links in HTML you could use preg_match_all().

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

That url https://qc.yahoo.com/ uses gzip compression , so you have to detect that and decompress it using the function gzdecode(). (It must be installed in your PHP version)

The gzip compression is indicated by the Content-Encoding: gzip HTTP header. You have to check that header, so you must use curl or a similar method to retrieve the headers. (file_get_contents() will not give you the HTTP headers... it only downloads the gzip compressed content. You need to detect that it is compressed but for that you need to read the headers.)

Here is a complete example:

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt ($c, CURLOPT_HEADER, true);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec ($c);
$hsize = curl_getinfo ($c, CURLINFO_HEADER_SIZE);
curl_close ($c);

# separate headers from content
$headers = substr ($content, 0, $hsize);
$content = substr ($content, $hsize);

# check if content is compressed with gzip
$gzip = 0;
$headers = preg_split ('/\r?\n/', $headers);
foreach ($headers as $h)
{
    $pieces = preg_split ("/:/", $h, 2);
    $pieces2 = (count ($pieces) > 1);
    $enc = $pieces2 && (preg_match ("/content-encoding/i", $pieces[0]) );
    $gz = $pieces2 && (preg_match ("/gzip/i", $pieces[1]) );
    if ($enc && $gz)
    {
        $gzip = 1;
        break;
    }
}

# unzip content if gzipped
if ($gzip)
{
    $content = gzdecode ($content);
}


# find links
$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

edited Jul 17, 2016 at 15:23

answered Jul 15, 2016 at 12:55

gregn3

1,8042 gold badges21 silver badges27 bronze badges

11 Comments

sikuda Over a year ago

hy @gregn3 thank you for understand my post what i dont know the keyword, after i use your code i get eroor, here i check my php 5.6.23, gzdecode OK, zlib extension loaded, but PHP Fatal error: Call to undefined function gzip_inflate() generate.. why ? Please your help.

sikuda Over a year ago

btw sorry before i want give upvote but Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score #myrputation is bad T.T

sikuda Over a year ago

example if i open form original site there is have 10 links. but when i curl the site they display only 5 links.. how to display all links?

gregn3 Over a year ago

@ane Hi ane, to get all links on the page you could try to tweak the regex used. Maybe this is not matching all of them: "/href=\"([^\"]+)\"/i"

Deepak Chaudhary Over a year ago

Then adding curl option curl_setopt($c, CURLOPT_ENCODING , "gzip"); will do the task. After that, curl itself will decompress the response.

|

Collectives™ on Stack Overflow

php : how to get all hyperlinks from a specific div of a given page?

3 Answers 3

2 Comments

11 Comments

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

11 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related