PHP Simple HTML DOM Parser returning false on valid url

Question

I'm trying the following:

$url = 'https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html'

$ta_html = file_get_html($url);
var_dump($ta_html);

it returns false, this is working and correctly getting the html for:

$url = 'https://www.tripadvisor.es/Hotels-g294316-Lima_Lima_Region-Hotels.html#ACCOM_OVERVIEW'

My first thought was that it had a redirect but I checked the headers with curl and its 200 ok and it seemed like the same on both cases. What can be happening? how it can be solved?

This seems to be a duplicate of this problem: Simple HTML DOM returning false that is also unanswered

what are you trying to scrap from that page? I prefer to use DOMDocument php built-in class. — Pedro Lobito
– Pedro Lobito, Commented Apr 22, 2017 at 17:15
I'm just experimenting with html simple dom parser. But I'd like to know the reason why on the same website what it seems to me as 2 equal urls one works and the other not — Aschab
– Aschab, Commented Apr 22, 2017 at 17:17

Spaceman Spiff · Accepted Answer · 2018-09-03 14:33:40Z

15

It looks like HTML DOM parser is failing because the HTML file size is greater than the library's max file size. When you call file_get_html() it does a file size check based on it's MAX_FILE_SIZE constant. So before calling any HTML DOM parser methods, increase the max file size used by the library by calling:

define('MAX_FILE_SIZE', 1200000); // or larger if needed, default is 600000

Also as as you found out you can work around the file size check with doing this

$html = new simple_html_dom();
$html->load($str);

answered Sep 3, 2018 at 14:33

Spaceman Spiff

7417 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aschab · Accepted Answer · 2017-04-22 20:42:18Z

2

So I found a workaround doing this:

$base = $url;
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);

$html = new simple_html_dom();
$html->load($str);

Truth be told I dont know exactly why this works, and what was the original problem, and I would appreciate if anyone could point that out

answered Apr 22, 2017 at 20:42

Aschab

1,3992 gold badges15 silver badges32 bronze badges

Comments

musicvicious · Accepted Answer · 2017-11-15 12:12:30Z

0

It looks like this is happening because of this check in simple_html_dom.php in the file_get_html() function

if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
    return false;
}

It might be that the length of the content is greater than the MAX_FILE_SIZE

answered Nov 15, 2017 at 12:12

musicvicious

1,09316 silver badges22 bronze badges

Comments

Shahzaib Chadhar · Accepted Answer · 2020-01-24 22:49:29Z

0

Hope it will help you:

$base = $url;
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);

$html = new simple_html_dom();
$html->load($str);

answered Jan 24, 2020 at 22:49

Shahzaib Chadhar

3253 silver badges4 bronze badges

Comments

getl0st · Accepted Answer · 2017-04-22 17:21:31Z

-1

Use file_get_contents() instead, works for me.

$url = "https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html";
file_put_contents("hello.html", file_get_contents($url));

file_get_html("Hello_html");

edited Apr 22, 2017 at 17:21

answered Apr 22, 2017 at 17:02

getl0st

3441 silver badge10 bronze badges

4 Comments

gaganshera Over a year ago

The OP wrote that it works for another url. This isn't the answer, nor the correct solution

getl0st Over a year ago

The url I used in the example, works, don't talk shit when you didn't test it.

Aschab Over a year ago

This works but I need to use file_get_html from simplehtmldom.sourceforge.net Not sure if my question is not well written

getl0st Over a year ago

Check my answer again

Collectives™ on Stack Overflow

PHP Simple HTML DOM Parser returning false on valid url

5 Answers 5

Comments

Comments

Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related