Scraping data using simple html dom and simpleXML

Question

I'm trying to scrape data from several links which i retrieve from a xml file. However i keep getting an error which only seem to appear on some of the news. below you can see the output i get

http://www.hltv.org/news/14971-rgn-pro-series-groups-drawnRGN Pro Series groups drawn

http://www.hltv.org/news/14969-k1ck-reveal-new-teamk1ck reveal new team

http://www.hltv.org/news/14968-world-championships-captains-unveiled
Fatal error: Call to a member function find() on a non-object in  /app/scrape.php on line 266

where this is line 266

$hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);

Full code

Scrape function

function scrape_hltv() {
    $hltv = "http://www.hltv.org/news.rss.php";
    $sxml = simplexml_load_file($hltv);
    global $con;
    foreach($sxml->channel->item as $item)
    {
        $hltv_title = (string)$item->title;
        $hltv_link = (string)$item->link;
        $hltv_date = date('Y-m-d H:i:s', strtotime((string)$item->pubDate));
        echo $hltv_link;

        //if (date('Y-m-d', strtotime((string)$item->pubDate)) ==  date('Y-m-d')){
            if (strpos($hltv_title,'Video:') === false) {
                $hltv_deep_link = file_get_html($hltv_link);
                $hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);


                echo $hltv_title . '<br><br>';

            }
        //}


    }

}

scrape_hltv();

why it returns error: Fatal error: Call to a member function find() on a non-object — Peter Pik
– Peter Pik, Commented May 15, 2015 at 13:52

dedek · Accepted Answer · 2015-05-15 14:41:47Z

1

There are several occasions when file_get_html() returns false.

See the source code here: http://sourceforge.net/p/simplehtmldom/code/HEAD/tree/trunk/simple_html_dom.php#l79

if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
    return false;
}

For your link

http://www.hltv.org/news/14968-world-championships-captains-unveiled

I think it is because the content of the page is larger than MAX_FILE_SIZE(600 000 bytes). The page size is actually around 3 MBs.

If you want to process larger files as well you can try modified version of the function:

define('DEFAULT_TARGET_CHARSET', 'UTF-8');
define('DEFAULT_BR_TEXT', "\r\n");
define('DEFAULT_SPAN_TEXT', " ");

function file_get_html_modified($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    if (empty($contents))
    {
        return false;
    }
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

... || strlen($contents) > MAX_FILE_SIZE was removed.

edited May 15, 2015 at 14:41

answered May 15, 2015 at 14:16

dedek

8,3943 gold badges46 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dedek Over a year ago

@PeterPik Do you want to process larger files as well? Or just don't want your code to crash?

Peter Pik Over a year ago

hmm i just want it to process the file even though it might be bigger

Collectives™ on Stack Overflow

Scraping data using simple html dom and simpleXML

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related