3

I'd like to scrape the actual the dynamically created URLs in this web page's menu using PHP:

http://groceries.iceland.co.uk/

I have previously used something like this:

<?php
$baseurls = array("http://groceries.iceland.co.uk/");

foreach ($baseurls as $source) 
{
    $html = file_get_contents($source);

    $start = strpos($html,'<nav id="mainNavigation"');
    $end = strpos($html,'</nav>',$start);
    $mainarea = substr($html,$start,$end-$start);

    $dom = new DOMDocument();
    @$dom->loadHTML($mainarea);

    // grab all the urls on the page
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");

    for ($i = 0; $i < $hrefs->length; $i++) 
    {
        $href = $hrefs->item($i);
        $url = $href->getAttribute('href');       
    }
}
?>

but it's not doing the job for this particular page. For example, my code returns a url such as:

groceries.iceland.co.uk//frozen-chips-and-potato-products

but I want it to give me: groceries.iceland.co.uk//frozen/chips-and-potato-products/c/FRZCAP?q=:relevance&view=list

The browser adds "/c/FRZCAP?q=:relevance&view=list" to the end and this is what I want.

Hope you can help Thanks

4
  • 2
    One quick note is that I think you are better off not using substr() on HTML (I don't think there will be any major performance trade off by loading the entire HTML to the DOMDocument). As far as the issue, if the extra data is appended by JS after a page load, PHP will never be able to see this. You will need to try a headless browser based on JS like PhantomJS, SlimerJS, Zombie.js, etc. Commented Jan 20, 2014 at 21:08
  • 1
    The first step with questions like these is to turn off JavaScript in your browser, refresh the page you want, and see if the data is still there. If it is, you can probably do what you want with a cURL-based library (Goutte is excellent, based on Guzzle). If it is not, then you'll need a (slower) headless browser (see @Sam's comment). Commented Jan 20, 2014 at 21:30
  • 1
    watch the network tab of firebug/chrome inspector. the info they add to your page comes from somewhere, often times that's from a remote rest API that runs after the page loads. sometimes the api has everything you want in a nice clean package you don't even need to scrape... Commented Jan 20, 2014 at 21:32
  • Do you absolutely need to use php? When I do this I tend to write some jquery in firebug, run it manually and get all the links after the page is loaded. If you're trying to do this once off (and not on a schedule) let me know and I'll show you how to do it with jquery. Commented Jan 21, 2014 at 0:25

1 Answer 1

1

Edit: Just to confirm, I had at look at the website you're trying to scrape with JavaScript turned off and it appears that the Mainnav urls are generated using JavaScript, so you will be unable to scrape the page without using a headless browser.

Per @Sam and @halfer's comments, if you need to scrape a site that has dynamic URLs generated by JavaScript then you will need to use a scraper that supports JavaScript.

If you want to do the bulk of your development in PHP, then I recommend not trying to use a headless browser via PHP and instead relying on a service that can scrape a JavaScript rendered page and return the contents for you.

The best one that I've found, and one that we use in our projects, is https://phantomjscloud.com/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.