I am looking for an efficient way to strip HTML comments from a string representation of HTML:
<div>
<!-- remove this -->
<ul>
<!-- and this -->
<li></li>
<li></li>
</ul>
</div>
I do not want to convert the string to actual nodes, the content is originally a string and the filesize is around 600mb.
Curious if anyone has had this problem before and found an efficient, and easily generalized solution.
<div title="z<!--this is not"><div title="a comment-->zz">(notice, though, this is valid HTML) then regexes are acceptable, as comments are not nestable (<!-- <!-- -->is the "same" as<!-- <!-- --> -->). If you do, or if you are afraid you might, among other reasons, then consider a broader tool, like a parser.<!-- <!-- --> -->is not the same as<!-- <!-- -->. The first parses too{ignored} -->, the second parses as{ignored}where{ignored}is the part the HTML parser ignores. This is precicely because comments are not nestable.-->would not be a part of the comment, just as you said, so the regexes wouldn't have to mind nesting. (Edit: It does not matter now, I just tested,<!-- aa <!-- aa -->is not valid HTML as I predicted. It yields the error The document is not mappable to XML 1.0 due to two consecutive hyphens in a comment. in the second--, meaning two hyphens inside a comment are only allowed to close it, nothing else.)