1

I'm trying to scrape data from several links which i retrieve from a xml file. However i keep getting an error which only seem to appear on some of the news. below you can see the output i get

http://www.hltv.org/news/14971-rgn-pro-series-groups-drawnRGN Pro Series groups drawn

http://www.hltv.org/news/14969-k1ck-reveal-new-teamk1ck reveal new team

http://www.hltv.org/news/14968-world-championships-captains-unveiled
Fatal error: Call to a member function find() on a non-object in  /app/scrape.php on line 266

where this is line 266

$hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);

Full code

Scrape function

function scrape_hltv() {
    $hltv = "http://www.hltv.org/news.rss.php";
    $sxml = simplexml_load_file($hltv);
    global $con;
    foreach($sxml->channel->item as $item)
    {
        $hltv_title = (string)$item->title;
        $hltv_link = (string)$item->link;
        $hltv_date = date('Y-m-d H:i:s', strtotime((string)$item->pubDate));
        echo $hltv_link;

        //if (date('Y-m-d', strtotime((string)$item->pubDate)) ==  date('Y-m-d')){
            if (strpos($hltv_title,'Video:') === false) {
                $hltv_deep_link = file_get_html($hltv_link);
                $hltv_full_text = $hltv_deep_link->find("//div[@class='rNewsContent']", 0);


                echo $hltv_title . '<br><br>';

            }
        //}


    }

}

scrape_hltv();
2
  • What is your question? Commented May 15, 2015 at 13:43
  • why it returns error: Fatal error: Call to a member function find() on a non-object Commented May 15, 2015 at 13:52

1 Answer 1

1

There are several occasions when file_get_html() returns false.

See the source code here: http://sourceforge.net/p/simplehtmldom/code/HEAD/tree/trunk/simple_html_dom.php#l79

if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
    return false;
}

For your link

http://www.hltv.org/news/14968-world-championships-captains-unveiled

I think it is because the content of the page is larger than MAX_FILE_SIZE(600 000 bytes). The page size is actually around 3 MBs.

If you want to process larger files as well you can try modified version of the function:

define('DEFAULT_TARGET_CHARSET', 'UTF-8');
define('DEFAULT_BR_TEXT', "\r\n");
define('DEFAULT_SPAN_TEXT', " ");

function file_get_html_modified($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    if (empty($contents))
    {
        return false;
    }
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

... || strlen($contents) > MAX_FILE_SIZE was removed.

Sign up to request clarification or add additional context in comments.

2 Comments

@PeterPik Do you want to process larger files as well? Or just don't want your code to crash?
hmm i just want it to process the file even though it might be bigger

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.