-1

I'm trying to get all link URL of news on some div from this web

To get all link, after I view source but there is nothing.

But there are any data display

Could any that understand PHP, Array() and JS help me, please?

This is my code to get the content:

$html = file_get_contents("https://qc.yahoo.com/");
if ($result === FALSE) {
    die("?");
} 
echo $html;
2
  • I'm having a hard time understanding. It would help if you showed us a sample $html input, and what you would like to have when you're done processing. Just a small sample, enough that we understand what you're trying to do. Commented Jul 15, 2016 at 9:15
  • hy @BeetleJuice have u check stackoverflow.com/a/38396700/6516181 that what i mean, sorry im not advanced in coding & name of keyword. Please your help ^^ Commented Jul 16, 2016 at 2:15

3 Answers 3

3
$html = new DOMDocument();
@$html->loadHtmlFile('https://qc.yahoo.com/');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@id='news_moreTopStories']//a/@href" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

you can get all links from the divs you specify. make sure you put the div ids in id='news_moreTopStories']. you're using xpath to query the divs. you don't need a ton of code, just this portion.

http://php.net/manual/en/class.domxpath.php

Sign up to request clarification or add additional context in comments.

2 Comments

hy sir, thank you for helping us too, this will be add more solutions for me ^^
Yes this a better solution, but it doesn't seem to decode the gzip-ed content.
2

Assuming, you want to extract all Anchor Tags with their hyperlinks from the given page.

Now there are certain problems with doing file_get_contents on that URL :

  1. Character encoding for Compression, i.e gzip
  2. SSL Verification of the URL.

So, to overcome first problem of gzip character encoding, we'll use CURL as @gregn3 suggested in his answer. But he missed to use CURL's ability to automatically decompress gziped content.

For second problem, you can either follow this guide or disable SSL verification from CURL's curl_setopt methods.

Now the code which will extract all the links from the given page is :

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt($c, CURLOPT_HTTPHEADER, ["Accept-Encoding:gzip"]);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($c, CURLOPT_ENCODING , "gzip");
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($c, CURLOPT_SSL_VERIFYHOST, 0);
$content = curl_exec ($c);

curl_close ($c);

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

But if you want to do advance html parsing, then you'll need to use PHP Simple HTML Dom Parser. In PHP Simple HTML Dom you can select the div by using jQuery selectors and fetch the anchor tags. Here are it's documentation & api manual.

11 Comments

Thanks @Deepak , I was not very familiar with CURL , but now I know about this too. :)
no i like this. this make me more understand. thank you for describe & knowledge sir :* kiss hug .. #awesome btw what have you socmed, i want add you sir
:) and Sorry, I don't know what socmed is.
@DeepakChaudhary social media sir.. :3
Ahh.. :D I'm not that active on socmed.
|
0

To find all links in HTML you could use preg_match_all().

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

That url https://qc.yahoo.com/ uses gzip compression , so you have to detect that and decompress it using the function gzdecode(). (It must be installed in your PHP version)

The gzip compression is indicated by the Content-Encoding: gzip HTTP header. You have to check that header, so you must use curl or a similar method to retrieve the headers. (file_get_contents() will not give you the HTTP headers... it only downloads the gzip compressed content. You need to detect that it is compressed but for that you need to read the headers.)

Here is a complete example:

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt ($c, CURLOPT_HEADER, true);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec ($c);
$hsize = curl_getinfo ($c, CURLINFO_HEADER_SIZE);
curl_close ($c);

# separate headers from content
$headers = substr ($content, 0, $hsize);
$content = substr ($content, $hsize);

# check if content is compressed with gzip
$gzip = 0;
$headers = preg_split ('/\r?\n/', $headers);
foreach ($headers as $h)
{
    $pieces = preg_split ("/:/", $h, 2);
    $pieces2 = (count ($pieces) > 1);
    $enc = $pieces2 && (preg_match ("/content-encoding/i", $pieces[0]) );
    $gz = $pieces2 && (preg_match ("/gzip/i", $pieces[1]) );
    if ($enc && $gz)
    {
        $gzip = 1;
        break;
    }
}

# unzip content if gzipped
if ($gzip)
{
    $content = gzdecode ($content);
}


# find links
$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

11 Comments

hy @gregn3 thank you for understand my post what i dont know the keyword, after i use your code i get eroor, here i check my php 5.6.23, gzdecode OK, zlib extension loaded, but PHP Fatal error: Call to undefined function gzip_inflate() generate.. why ? Please your help.
btw sorry before i want give upvote but Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score #myrputation is bad T.T
example if i open form original site there is have 10 links. but when i curl the site they display only 5 links.. how to display all links?
@ane Hi ane, to get all links on the page you could try to tweak the regex used. Maybe this is not matching all of them: "/href=\"([^\"]+)\"/i"
Then adding curl option curl_setopt($c, CURLOPT_ENCODING , "gzip"); will do the task. After that, curl itself will decompress the response.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.