6

I am "attempting" to scrape a web page that has the following structures within the page:

<p class="row">
    <span>stuff here</span>
    <a href="http://www.host.tld/file.html">Descriptive Link Text</a>
    <div>Link Description Here</div>
</p>

I am scraping the webpage using curl:

<?php
    $handle = curl_init();
    curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($handle);
    curl_close($handle);
?>

I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:

$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo $printString . "<br>";
}

Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:

for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}

As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

3 Answers 3

4

According to comments on the PHP manual on DOM, you should use the following inside your loop:

    $tmp_dom = new DOMDocument();
    $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
    $innerHTML = trim($tmp_dom->saveHTML()); 

This will set $innerHTML to be the HTML content of the node.

But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:

$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
    $sec = $sections->item($i);
    $links = $sec->getElementsByTagName('a');
    $linkNo = $links->length;
    for ($j=0; $j<$linkNo; $j++) {
        $printString = $links->item($j)->nodeValue;
        echo $printString . "<br>";
    }
}

This will just print the body of each link.

Sign up to request clarification or add additional context in comments.

1 Comment

You can also iterate over the nodes using foreach instead of the for loops. That will make it more compact and understandable, since you do not actually (seem to) need any of the indices.
1

You can pass a node to DOMDocument::saveXML(). Try this:

$printString = $newDom->saveXML($sections->item($i));

2 Comments

Yes, this will effectively return the outerHTML of the node
Apparently, the poster wanted the inner HTML, not the outer. That was not clear to me, but I will leave my answer up for the saveXML reference, anyway.
0

you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.