I know how to get the html source code via cUrl, but I want to remove the comments on the html document (I mean what is between <!-- .. -->). In addition, if I can take just the BODY of the html document. thank you.
-
you should reparse them manually... I have JavaScript library of my own for that, but I don't know how could you implement that in PHPmetaforce– metaforce2011-06-10 11:24:44 +00:00Commented Jun 10, 2011 at 11:24
-
there isn't a cUrl option for this?Luis– Luis2011-06-10 11:26:51 +00:00Commented Jun 10, 2011 at 11:26
Add a comment
|
6 Answers
Try PHP DOM*:
$html = '<html><body><!--a comment--><div>some content</div></body></html>'; // put your cURL result here
$dom = new DOMDocument;
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
$comment->parentNode->removeChild($comment);
}
$body = $xpath->query('//body')->item(0);
$newHtml = $body instanceof DOMNode ? $dom->saveXml($body) : 'something failed';
var_dump($newHtml);
Output:
string(36) "<body><div>some content</div></body>"
2 Comments
Luis
Look it is working well, I have never heared about DOM. thank you.
vee
To make multi-line of original HTML work and not showing
for newline, change saveXML() to saveHTML(). To make the result including <html> element, change loadHTML($html) to loadHTML($html, LIBXML_HTML_NODEFDTD) and change $newHtml line to $newHtml = $body instanceof DOMNode ? $dom->saveHTML() : 'something failed';.If there's no option for this in cUrl (and I suspect there isn't, but I've been wrong before) then you can at the very least parse the resulting HTML to your heart's content with a PHP DOM parser.
This will likely be your best bet in the long run in terms of configurability and support.
1 Comment
Daniel Stenberg
Correct, there's no such option in curl. It just gets the data as the server sends it.
I've run in to issues modifying a DOMNodeList in a foreach loop which went away went I iterated backwards through the list. For that reason, I'd would not recommend a foreach loop as in the accepted answer. Instead use a for loop like this:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
for ($els = $xpath->query('//comment()'), $i = $els->length - 1; $i >= 0; $i--) {
$els->item($i)->parentNode->removeChild($els->item($i));
}