How to parse actual HTML from page using CURL?

Question

I am "attempting" to scrape a web page that has the following structures within the page:

<p class="row">
    <span>stuff here</span>
    <a href="http://www.host.tld/file.html">Descriptive Link Text</a>
    <div>Link Description Here</div>
</p>

I am scraping the webpage using curl:

<?php
    $handle = curl_init();
    curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($handle);
    curl_close($handle);
?>

I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:

$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo $printString . "<br>";
}

Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:

for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}

As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

Borealid · Accepted Answer · 2010-08-04 20:01:56Z

4

According to comments on the PHP manual on DOM, you should use the following inside your loop:

    $tmp_dom = new DOMDocument();
    $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
    $innerHTML = trim($tmp_dom->saveHTML());

This will set $innerHTML to be the HTML content of the node.

But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:

$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
    $sec = $sections->item($i);
    $links = $sec->getElementsByTagName('a');
    $linkNo = $links->length;
    for ($j=0; $j<$linkNo; $j++) {
        $printString = $links->item($j)->nodeValue;
        echo $printString . "<br>";
    }
}

This will just print the body of each link.

edited Aug 4, 2010 at 20:01

answered Aug 4, 2010 at 19:53

Borealid

99.4k9 gold badges112 silver badges124 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

janmoesen Over a year ago

You can also iterate over the nodes using foreach instead of the for loops. That will make it more compact and understandable, since you do not actually (seem to) need any of the indices.

janmoesen · Accepted Answer · 2010-08-04 20:02:22Z

1

You can pass a node to DOMDocument::saveXML(). Try this:

$printString = $newDom->saveXML($sections->item($i));

answered Aug 4, 2010 at 20:02

janmoesen

8,0411 gold badge25 silver badges19 bronze badges

2 Comments

Gordon Over a year ago

Yes, this will effectively return the outerHTML of the node

janmoesen Over a year ago

Apparently, the poster wanted the inner HTML, not the outer. That was not clear to me, but I will leave my answer up for the saveXML reference, anyway.

Scott Evernden · Accepted Answer · 2010-08-04 19:59:08Z

0

you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example

answered Aug 4, 2010 at 19:59

Scott Evernden

40.1k15 gold badges81 silver badges84 bronze badges

Collectives™ on Stack Overflow

How to parse actual HTML from page using CURL?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related