0

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.

this is a basic version of my code because I had to break it down alot to iscolate the problem.

$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}`
3
  • Could you show an example URL from urls.txt? Commented Mar 14, 2012 at 22:55
  • 2
    asking us questions about loadHtmlFile failing is rude when you put an @ in front of it. remove the error suppression before asking questions. Commented Mar 14, 2012 at 22:57
  • sorry about that, I copied the code down from elsewhere because I am still learning it, and was not aware the @ symbol suppressed errors. So i removed it and learned that there is a " " in front of the "http://". Now my question is how do I fix this Commented Mar 15, 2012 at 18:31

2 Answers 2

1

The code below works like a champ tested with your example data:

<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}
?>

ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:

<?php
    $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

    $url = $urlarray[0];
    foreach($urlarray as $url) {
        if(!empty($url)) {
            $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
            curl_setopt($ch, CURLOPT_URL,trim($url));
            curl_setopt($ch, CURLOPT_FAILONERROR, true);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_AUTOREFERER, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $html = curl_exec($ch);

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

            $anchors = $dom->getElementsByTagName('a');
            foreach($anchors as $element)
            {
                $title = $element->getAttribute('title');
                $class = $element->getAttribute('class');
                if($class == 'result_link')
                {
                    $title = str_replace('Synonyms of ', '', $title);
                    echo $title . "<br />";
                }
            }
            echo '<hr />';
        }
    }
?>
Sign up to request clarification or add additional context in comments.

Comments

0

So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?

If so there is a problem here: $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?

I would var dump $contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.

If yes, then I would explode the into $urlarray, and var dump $urlarray[0]

if it looks right I would trim it before being sent to dom with trim($urlarray[0])

I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.

Let me know the results and I will try to help further, or post all sample code including URLS.txt

And I will run it locally

6 Comments

technically this is a comment and not an answer as it only suggests how to debug (but doesnt suggest to remove the error suppression operator which will likely tell the problem which will allow us to answer the question)
Agreed! However trying to teach a man to fish, than giving him the fish, lol. Should I remove Not trying to get down voted?
up to you. i wont downvote it but imo it should be a comment.
for the record, I am able to echo out $url and it will print out fine. as for the textfile I have, here are some of the lines: thesaurus.com/list/a/a+1/1|http://thesaurus.com/list/a/…
OK As expected I ran your code with the @ symbol removed and I get this error: Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: URL file-access is disabled in the server configuration in
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.