PHP DOM html issue with a certain tag

Question

ppl. I ussualy find my answers looking the web and stackoverflow, but this time couldn't resolve my issue. I'm using php dom for parse a website and extract some data from it, but for some reason, all the ways i tryed keep returning me less items than the number on the page.

Tryed with "simple php simple html dom", "php advanced html dom" and the native php dom... but still get, in this case, 14 article tags.

http://www.emol.com/movil/nacional/

In this site there are 28 elements tagged "article", but i always get 14 (or less)

Tryed using the classic find (from simple and advance), with all the combinations possible; and with the native one, query xpath and getelementsbytag.

$xpath->query('//article');
$xpath->query('//*[@id="listNews"]/article[6]') //even this don't work
$html->find('article:not(.sec_mas_vistas_emol), article'); //return 14

So my guess was the way i was loading the url... so i tryed the classic "file_get_html", curl, and some custom functions... and all them are the same. What is more extrange, is if i use a a online xpath tester, copy all the html and use the "query->('//article')... it find all. This are my two last tests:

//Way 1
$html = file_get_html('http://www.emol.com/movil/nacional/');
$lidata = $html->find('article');

//Way 2
$url = 'http://www.emol.com/movil/nacional';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$e = curl_exect($ch);
$dom = new DOMDocument;
@$dom->loadHTML($e); //tryed with loadHTMLFile too and the libxml_use_internal_erros
$xpath = new DOMXPath($dom);
$xpath->query('//article');

Any suggestion on what could be the issue and a way to fix it? Actually, is my first incursion with PHP dom, so possible there is something i'm missing.

On the provided link there are only 14 article elements exist. — marcell
– marcell, Commented Jan 6, 2018 at 17:11
I agree with @marcell. There are only 14 articles on that page — Andreas
– Andreas, Commented Jan 6, 2018 at 17:15
Nop, 30. Check on inspector, easy way find=><article. The first area have 8, the second 10 (ultimo minuto), the third 10 (noticias más vistas). And 2 article elements are containers. Each "news" is an article tag element. Checked again in case i was wrong... but i see 30. screenshot — Rodrigo Aliaga
– Rodrigo Aliaga, Commented Jan 6, 2018 at 17:15
Nope, if you view the source on that page there will be 14 articles. That is what you get when you fetch the page from php, and that's why you get only 14 articles. Tried it myself. — Andreas
– Andreas, Commented Jan 6, 2018 at 17:23
You can proceed with a headless broswer to fetch dynamic data. There is a php wrapper for the casperjs library, that is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Gecko) headless browsers, written in Javascript. See here. — marcell
– marcell, Commented Jan 6, 2018 at 18:01

marcell · Accepted Answer · 2018-01-06 18:06:43Z

1

Maybe my comment above and this example can help you to proceed.

With phpcasperjs wrapper:

<?php 

require_once 'vendor/autoload.php';

use Browser\Casper;

$casper = new Casper();
$casper->start('http://www.emol.com/movil/nacional/');
$casper->wait(5000);
$output = $casper->getOutput();
$casper->run();
$html = $casper->getHtml();
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

With file_get_contents as you tried before:

<?php

$html = file_get_contents('http://www.emol.com/movil/nacional/');
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

Counts 30 (with phpcasperjs) vs 14 (with file_get_contents).

answered Jan 6, 2018 at 18:06

marcell

1,5381 gold badge10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rodrigo Aliaga Over a year ago

Thanks a lot, but no sure if going to work where i want to implement it. Anyway, is a good guide for me to continue.

marcell Over a year ago

You are welcome. Also note before you experiment with the script above, you have to install phantomjs and casperjs: npm install -g phantomjs, npm install -g casperjs.

Collectives™ on Stack Overflow

PHP DOM html issue with a certain tag

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related