php dom not accepting url

Question

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.

this is a basic version of my code because I had to break it down alot to iscolate the problem.

$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}`

asking us questions about loadHtmlFile failing is rude when you put an @ in front of it. remove the error suppression before asking questions. — Gordon
– Gordon, Commented Mar 14, 2012 at 22:57
sorry about that, I copied the code down from elsewhere because I am still learning it, and was not aware the @ symbol suppressed errors. So i removed it and learned that there is a " " in front of the "http://". Now my question is how do I fix this — bs7280
– bs7280, Commented Mar 15, 2012 at 18:31

Tim Wickstrom · Accepted Answer · 2012-03-15 20:54:08Z

The code below works like a champ tested with your example data:

<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}
?>

ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:

<?php
    $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

    $url = $urlarray[0];
    foreach($urlarray as $url) {
        if(!empty($url)) {
            $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
            curl_setopt($ch, CURLOPT_URL,trim($url));
            curl_setopt($ch, CURLOPT_FAILONERROR, true);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_AUTOREFERER, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $html = curl_exec($ch);

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

            $anchors = $dom->getElementsByTagName('a');
            foreach($anchors as $element)
            {
                $title = $element->getAttribute('title');
                $class = $element->getAttribute('class');
                if($class == 'result_link')
                {
                    $title = str_replace('Synonyms of ', '', $title);
                    echo $title . "<br />";
                }
            }
            echo '<hr />';
        }
    }
?>

Tim Wickstrom · Accepted Answer · 2012-03-14 23:02:45Z

0

So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?

If so there is a problem here: $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?

I would var dump $contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.

If yes, then I would explode the into $urlarray, and var dump $urlarray[0]

if it looks right I would trim it before being sent to dom with trim($urlarray[0])

I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.

Let me know the results and I will try to help further, or post all sample code including URLS.txt

And I will run it locally

answered Mar 14, 2012 at 23:02

Tim Wickstrom

5,7114 gold badges27 silver badges33 bronze badges

6 Comments

Gordon Over a year ago

technically this is a comment and not an answer as it only suggests how to debug (but doesnt suggest to remove the error suppression operator which will likely tell the problem which will allow us to answer the question)

Tim Wickstrom Over a year ago

Agreed! However trying to teach a man to fish, than giving him the fish, lol. Should I remove Not trying to get down voted?

Gordon Over a year ago

up to you. i wont downvote it but imo it should be a comment.

bs7280 Over a year ago

for the record, I am able to echo out $url and it will print out fine. as for the textfile I have, here are some of the lines: thesaurus.com/list/a/a+1/1|http://thesaurus.com/list/a/…

Tim Wickstrom Over a year ago

OK As expected I ran your code with the @ symbol removed and I get this error: Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: URL file-access is disabled in the server configuration in

|

Collectives™ on Stack Overflow

php dom not accepting url

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related