I'm working on a PHP scraper to do the following:
cURLseveral (always fewer than 10) URLs,Add the HTML from each URL to a
DOMDocument,Parse that DOMdocument for
<a>elements which link to PDFs,Store the
hrefs for matching elements in an array.
I have steps 1 & 2 down (my code outputs the combined HTML for all URLs), but when I try to iterate through the result to find ` elements linking to PDFs, I get nothing (an empty array).
I've tried my parser code on a single cURL and it works (returns an array with the URLs for each pdf on that page).
Here's my cURL code:
$urls = Array(
'http://www.example.com/about/1.htm',
'http://www.example.com/about/2.htm',
'http://www.example.com/about/3.htm',
'http://www.example.com/about/4.htm'
);
# Make DOMDoc
$dom = new DOMDocument();
foreach ($urls as $url) {
$ch = curl_init($url);
$html = curl_exec($ch);
# Exec and close CURL, suppressing errors
@$dom->createDocumentFragment($html);
curl_close($ch);
}
And the parser code:
#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
$linkh = $link->getAttribute('href');
$filend = ".pdf";
# @ at beginning supresses string length warning
@$pdftester = substr_compare($linkh, $filend, -4, 4, true);
if ($pdftester === 0) {
array_push($pdf_array, $linkh);
}
}
The full code looks like this:
<?php
$urls = Array(
'http://www.example.com/about/1.htm',
'http://www.example.com/about/2.htm',
'http://www.example.com/about/3.htm',
'http://www.example.com/about/4.htm'
);
# Make DOM parser
$dom = new DOMDocument();
foreach ($urls as $url) {
$ch = curl_init($url);
$html = curl_exec($ch);
# Exec and close CURL, suppressing errors
@$dom->createDocumentFragment($html);
curl_close($ch);
}
#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
$linkh = $link->getAttribute('href');
$filend = ".pdf";
# @ at beginning supresses string length warning
@$pdftester = substr_compare($linkh, $filend, -4, 4, true);
if ($pdftester === 0) {
array_push($pdf_array, $linkh);
}
}
print_r($pdf_array);
?>
Any suggestions for what I'm doing wrong on the DOM parsing and PDF array building?