3

ppl. I ussualy find my answers looking the web and stackoverflow, but this time couldn't resolve my issue. I'm using php dom for parse a website and extract some data from it, but for some reason, all the ways i tryed keep returning me less items than the number on the page.

Tryed with "simple php simple html dom", "php advanced html dom" and the native php dom... but still get, in this case, 14 article tags.

http://www.emol.com/movil/nacional/

In this site there are 28 elements tagged "article", but i always get 14 (or less)

Tryed using the classic find (from simple and advance), with all the combinations possible; and with the native one, query xpath and getelementsbytag.

$xpath->query('//article');
$xpath->query('//*[@id="listNews"]/article[6]') //even this don't work
$html->find('article:not(.sec_mas_vistas_emol), article'); //return 14

So my guess was the way i was loading the url... so i tryed the classic "file_get_html", curl, and some custom functions... and all them are the same. What is more extrange, is if i use a a online xpath tester, copy all the html and use the "query->('//article')... it find all. This are my two last tests:

//Way 1
$html = file_get_html('http://www.emol.com/movil/nacional/');
$lidata = $html->find('article');

//Way 2
$url = 'http://www.emol.com/movil/nacional';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$e = curl_exect($ch);
$dom = new DOMDocument;
@$dom->loadHTML($e); //tryed with loadHTMLFile too and the libxml_use_internal_erros
$xpath = new DOMXPath($dom);
$xpath->query('//article');

Any suggestion on what could be the issue and a way to fix it? Actually, is my first incursion with PHP dom, so possible there is something i'm missing.

7
  • On the provided link there are only 14 article elements exist. Commented Jan 6, 2018 at 17:11
  • I agree with @marcell. There are only 14 articles on that page Commented Jan 6, 2018 at 17:15
  • Nop, 30. Check on inspector, easy way find=><article. The first area have 8, the second 10 (ultimo minuto), the third 10 (noticias más vistas). And 2 article elements are containers. Each "news" is an article tag element. Checked again in case i was wrong... but i see 30. screenshot Commented Jan 6, 2018 at 17:15
  • Nope, if you view the source on that page there will be 14 articles. That is what you get when you fetch the page from php, and that's why you get only 14 articles. Tried it myself. Commented Jan 6, 2018 at 17:23
  • 1
    You can proceed with a headless broswer to fetch dynamic data. There is a php wrapper for the casperjs library, that is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Gecko) headless browsers, written in Javascript. See here. Commented Jan 6, 2018 at 18:01

1 Answer 1

1

Maybe my comment above and this example can help you to proceed.

With phpcasperjs wrapper:

<?php 

require_once 'vendor/autoload.php';

use Browser\Casper;

$casper = new Casper();
$casper->start('http://www.emol.com/movil/nacional/');
$casper->wait(5000);
$output = $casper->getOutput();
$casper->run();
$html = $casper->getHtml();
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

With file_get_contents as you tried before:

<?php

$html = file_get_contents('http://www.emol.com/movil/nacional/');
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

Counts 30 (with phpcasperjs) vs 14 (with file_get_contents).

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot, but no sure if going to work where i want to implement it. Anyway, is a good guide for me to continue.
You are welcome. Also note before you experiment with the script above, you have to install phantomjs and casperjs: npm install -g phantomjs, npm install -g casperjs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.