I'd like to scrape the actual the dynamically created URLs in this web page's menu using PHP:
http://groceries.iceland.co.uk/
I have previously used something like this:
<?php
$baseurls = array("http://groceries.iceland.co.uk/");
foreach ($baseurls as $source)
{
$html = file_get_contents($source);
$start = strpos($html,'<nav id="mainNavigation"');
$end = strpos($html,'</nav>',$start);
$mainarea = substr($html,$start,$end-$start);
$dom = new DOMDocument();
@$dom->loadHTML($mainarea);
// grab all the urls on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
}
}
?>
but it's not doing the job for this particular page. For example, my code returns a url such as:
groceries.iceland.co.uk//frozen-chips-and-potato-products
but I want it to give me:
groceries.iceland.co.uk//frozen/chips-and-potato-products/c/FRZCAP?q=:relevance&view=list
The browser adds "/c/FRZCAP?q=:relevance&view=list" to the end and this is what I want.
Hope you can help Thanks
substr()on HTML (I don't think there will be any major performance trade off by loading the entire HTML to theDOMDocument). As far as the issue, if the extra data is appended by JS after a page load, PHP will never be able to see this. You will need to try a headless browser based on JS like PhantomJS, SlimerJS, Zombie.js, etc.