2

I am looking for an efficient way to strip HTML comments from a string representation of HTML:

<div>
  <!-- remove this -->
  <ul>
    <!-- and this -->
    <li></li>
    <li></li>
  </ul>
</div>

I do not want to convert the string to actual nodes, the content is originally a string and the filesize is around 600mb.

Curious if anyone has had this problem before and found an efficient, and easily generalized solution.

11
  • 3
    Regular expressions acceptable? Commented Jul 22, 2013 at 16:52
  • absolutely, preferred Commented Jul 22, 2013 at 16:54
  • If you don't have nasty markup like <div title="z<!--this is not"><div title="a comment-->zz"> (notice, though, this is valid HTML) then regexes are acceptable, as comments are not nestable (<!-- <!-- --> is the "same" as <!-- <!-- --> -->). If you do, or if you are afraid you might, among other reasons, then consider a broader tool, like a parser. Commented Jul 22, 2013 at 16:55
  • @acdcjunior <!-- <!-- --> --> is not the same as <!-- <!-- -->. The first parses too {ignored} -->, the second parses as {ignored} where {ignored} is the part the HTML parser ignores. This is precicely because comments are not nestable. Commented Jul 22, 2013 at 17:00
  • @dtech You are right. I expressed myself wrongly. I meant the initial part was the same. Meaning the second --> would not be a part of the comment, just as you said, so the regexes wouldn't have to mind nesting. (Edit: It does not matter now, I just tested, <!-- aa <!-- aa --> is not valid HTML as I predicted. It yields the error The document is not mappable to XML 1.0 due to two consecutive hyphens in a comment. in the second --, meaning two hyphens inside a comment are only allowed to close it, nothing else.) Commented Jul 22, 2013 at 17:04

1 Answer 1

3

assuming the variable s represents your html string, a RexExp replace as follows should work just fine.

s = s.replace(/<!--[\s\S]+?-->/g,"");

Variable s should now have comments removed.

Sign up to request clarification or add additional context in comments.

3 Comments

Will test, although I am sure this is likely the best solution
Be aware of the gotcha that of -- inside HTML comments are "forbidden" according to MDN : developer.mozilla.org/en-US/docs/Web/API/Comment#Specification but are allowed by the spec though it makes a note that some tools may not allow them. This trips me up when I occasionally try and comment out script tags
the ban on "--" is for XML documents, not HTML documents. read your link again.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.