0

I'm trying to create a small URL crawler for internal use within the company I work for.

Currently, I have a helper class where all the magic happens and an index.php that displays the results.

What I'd like to happen, is for a URL to be given and the code to go away and fetch all page URLS that the site contains for display on the screen.

However, waiting until this foreach loop finishes takes an age and as a result, I'd like to echo the link after each iteration of the loop.

I can't get it to work. I don't know if it's the link fetching code, or my attempts to flush the output buffer. I've followed the examples in this question here: Echo 'string' while every long loop iteration (flush() not working)

My code is below (without the flushing attempts)

// INDEX.PHP

require_once('helper.php');

$helper = new Helper();

flush();
ob_flush();

$found = $helper->crawlSite('http://www.bbc.co.uk', 'http://www.bbc.uk');

echo count($found);


// HELPER.PHP

class Helper
{
    private $checked = [];
    private $foundUrls = [];

    public function __construct()
    {

    }

    public function getHTML($url)
    {
        $curl = curl_init($url);

        curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
        $html = curl_exec($curl);
        curl_close($curl);

        return $html;
    }

    public function getTagFromHTML($html, $tag)
    {
        $dom = new DOMDocument();
        $dom->loadHTML($html);

        return $dom->getElementsByTagName($tag);
    }

    function crawlSite($url, $initialUrl)
    {
        $html = $this->getHTML($url);
        $links = $this->getTagFromHTML($html, 'a');

        foreach ($links as $link) {
            echo $link->getAttribute('href') . '<br>';

            flush();
            ob_flush();

            if (!in_array($link->getAttribute('href'), $this->checked)) {
                if (strpos($link->getAttribute('href'), $initialUrl) !== FALSE) {
                    $this->foundUrls[] = $link->getAttribute('href');
                    $this->crawlSite($link->getAttribute('href'), $initialUrl);
                } else {
                    $this->foundUrls[] = $initialUrl . $link->getAttribute('href');
                    $this->crawlSite($initialUrl . $link->getAttribute('href'), $initialUrl);
                }

                $this->checked[] = $link->getAttribute('href');
            }else{
                echo "Already Checked <br>";

                flush();
                ob_flush();
            }
        }


        return $this->foundUrls;
    }
}

Update

Updated the code to a larger site to demonstrate the problem. Also included one of my attempts at flushing the output buffer and I also implemented @Dev Jyoti Behera's suggestion of moving the echo.

Update 2

Thanks to the suggestion (as mentioned above), I can now see live text being printed on the screen. I now have a second problem however, where the crawler seems to be ignoring the has been checked if statement and it will check and list the same URL over and over. /sigh - I love programming, honestly.

8
  • The code you updated with contains more left-brackets than closing ones. And it should echo out just fine if you just fix the number of brackets. You can just place it directly after the foreach($links as $link) { echo $link->getAttribute('href') . '<br>'; too, but I don't think that should make any difference. Commented Apr 8, 2016 at 15:47
  • Does the echo count($found); line return a value besides 0? Commented Apr 8, 2016 at 15:48
  • Thanks Qirel. @EatPeanutButter On a small site, like the one mentioned in the question, it returns 4 (which is correct in this case). On a larger site, however, I can't tell as I just get a spinner. Commented Apr 8, 2016 at 15:55
  • I see that $link->getAttribute('href') does not change inside the loop. Will it work for you to move the echo $link->getAttribute('href') line to the start of the loop body, before the if-else statement? This way, you will be able to see the link that is currently being crawled on. The way the code is written currently causes the link to be printed after all the crawling is over(which can take a very long time). Commented Apr 8, 2016 at 15:55
  • 1
    @Lewis: It sure does. But, consider adding it to the first line of the foreach loop's body. This way, for every iteration and $link, a new value will be printed. Commented Apr 8, 2016 at 15:59

1 Answer 1

-2

Have you tried using ob_flush()? Here is an example. Maybe this helps: https://gist.github.com/jtallant/3260398

Sign up to request clarification or add additional context in comments.

1 Comment

Hi, thanks for the reply! Yep, I've tried that (was in the answers in the question I linked)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.