12

I know how to get the html source code via cUrl, but I want to remove the comments on the html document (I mean what is between <!-- .. -->). In addition, if I can take just the BODY of the html document. thank you.

2
  • you should reparse them manually... I have JavaScript library of my own for that, but I don't know how could you implement that in PHP Commented Jun 10, 2011 at 11:24
  • there isn't a cUrl option for this? Commented Jun 10, 2011 at 11:26

6 Answers 6

34

Try PHP DOM*:

$html = '<html><body><!--a comment--><div>some content</div></body></html>'; // put your cURL result here

$dom = new DOMDocument;
$dom->loadHtml($html);

$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

$body = $xpath->query('//body')->item(0);
$newHtml = $body instanceof DOMNode ? $dom->saveXml($body) : 'something failed';

var_dump($newHtml);

Output:

string(36) "<body><div>some content</div></body>"
Sign up to request clarification or add additional context in comments.

2 Comments

Look it is working well, I have never heared about DOM. thank you.
To make multi-line of original HTML work and not showing &#13; for newline, change saveXML() to saveHTML(). To make the result including <html> element, change loadHTML($html) to loadHTML($html, LIBXML_HTML_NODEFDTD) and change $newHtml line to $newHtml = $body instanceof DOMNode ? $dom->saveHTML() : 'something failed';.
3

Regex solved this problem for me as follows:

function remove_html_comments($html = '') {
    return preg_replace('/<!--(.|\s)*?-->/', '', $html);
}

Comments

1

If there's no option for this in cUrl (and I suspect there isn't, but I've been wrong before) then you can at the very least parse the resulting HTML to your heart's content with a PHP DOM parser.

This will likely be your best bet in the long run in terms of configurability and support.

1 Comment

Correct, there's no such option in curl. It just gets the data as the server sends it.
0

I would pipe it to sed for a regex, something like

curl http://yoururl.com/test.html | sed -i "s/<!\-\-\s?\w+\s?\-\->//g" | sed "s/.?(<body>.?</body>).?/\1/"

The regexes may not be exact, but you get the idea...

Comments

0

I've run in to issues modifying a DOMNodeList in a foreach loop which went away went I iterated backwards through the list. For that reason, I'd would not recommend a foreach loop as in the accepted answer. Instead use a for loop like this:

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
for ($els = $xpath->query('//comment()'), $i = $els->length - 1; $i >= 0; $i--) {
    $els->item($i)->parentNode->removeChild($els->item($i));
}

Comments

0

This work in my case:

preg_replace('/<!--[\s\S]*?-->/', '', $html);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.