remove comments from html source code

Question

I know how to get the html source code via cUrl, but I want to remove the comments on the html document (I mean what is between ). In addition, if I can take just the BODY of the html document. thank you.

you should reparse them manually... I have JavaScript library of my own for that, but I don't know how could you implement that in PHP — metaforce
– metaforce, Commented Jun 10, 2011 at 11:24

Yoshi · Accepted Answer · 2011-06-10 11:35:49Z

34

Try PHP DOM*:

$html = '<html><body><!--a comment--><div>some content</div></body></html>'; // put your cURL result here

$dom = new DOMDocument;
$dom->loadHtml($html);

$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

$body = $xpath->query('//body')->item(0);
$newHtml = $body instanceof DOMNode ? $dom->saveXml($body) : 'something failed';

var_dump($newHtml);

Output:

string(36) "<body><div>some content</div></body>"

answered Jun 10, 2011 at 11:35

Yoshi

54.8k14 gold badges93 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Luis Over a year ago

Look it is working well, I have never heared about DOM. thank you.

vee Over a year ago

To make multi-line of original HTML work and not showing  for newline, change saveXML() to saveHTML(). To make the result including <html> element, change loadHTML($html) to loadHTML($html, LIBXML_HTML_NODEFDTD) and change $newHtml line to $newHtml = $body instanceof DOMNode ? $dom->saveHTML() : 'something failed';.

Cepheus · Accepted Answer · 2018-08-21 11:48:11Z

3

Regex solved this problem for me as follows:

function remove_html_comments($html = '') {
    return preg_replace('/<!--(.|\s)*?-->/', '', $html);
}

answered Aug 21, 2018 at 11:48

Cepheus

4,9616 gold badges38 silver badges48 bronze badges

Comments

David · Accepted Answer · 2011-06-10 11:30:56Z

1

If there's no option for this in cUrl (and I suspect there isn't, but I've been wrong before) then you can at the very least parse the resulting HTML to your heart's content with a PHP DOM parser.

This will likely be your best bet in the long run in terms of configurability and support.

answered Jun 10, 2011 at 11:30

David

221k42 gold badges245 silver badges337 bronze badges

1 Comment

Daniel Stenberg Over a year ago

Correct, there's no such option in curl. It just gets the data as the server sends it.

Tim Hoolihan · Accepted Answer · 2011-06-10 13:17:01Z

0

I would pipe it to sed for a regex, something like

curl http://yoururl.com/test.html | sed -i "s/<!\-\-\s?\w+\s?\-\->//g" | sed "s/.?(<body>.?</body>).?/\1/"

The regexes may not be exact, but you get the idea...

edited Jun 10, 2011 at 13:17

answered Jun 10, 2011 at 11:32

Tim Hoolihan

12.4k3 gold badges43 silver badges54 bronze badges

Comments

You Old Fool · Accepted Answer · 2020-02-26 18:21:54Z

0

I've run in to issues modifying a DOMNodeList in a foreach loop which went away went I iterated backwards through the list. For that reason, I'd would not recommend a foreach loop as in the accepted answer. Instead use a for loop like this:

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
for ($els = $xpath->query('//comment()'), $i = $els->length - 1; $i >= 0; $i--) {
    $els->item($i)->parentNode->removeChild($els->item($i));
}

answered Feb 26, 2020 at 18:21

You Old Fool

23.2k14 gold badges92 silver badges118 bronze badges

Comments

Mahdi Shad · Accepted Answer · 2022-02-19 21:13:56Z

0

This work in my case:

preg_replace('/<!--[\s\S]*?-->/', '', $html);

answered Feb 19, 2022 at 21:13

Mahdi Shad

1,4671 gold badge14 silver badges23 bronze badges

Collectives™ on Stack Overflow

remove comments from html source code

6 Answers 6

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related