4

So I need a regular expression for finding single line and multi line comments, but not in a string. (eg. "my /* string")

for testing (# single line, /* & */ multi line):

# complete line should be found
lorem ipsum # from this to line end
/*
  all three lines should be found
*/ but not here anymore
var x = "this # should not be found"
var y = "this /* shouldn't */ match either"
var z = "but" & /* this must match */ "_"

SO does the syntax display really well; I basically want all the gray text.
I don't care if its a single regex or two separates. ;)

EDIT: one more thing. the opposite would also satisfy me, searching for a string which is not in a comment
this is my current string matching: "[\s\S]*?(?<!\\)" (indeed: will not work with "\\")

EDIT2:
OK finally I wrote my own comment parser -.-
And if someone else is interested in the source code, grab it from here: https://github.com/relikd/CommentParser

8
  • What are you using it for? If you have a specific purpose in mind, someone may already have written something that does it for you. Commented Feb 9, 2012 at 0:34
  • The regex needed for that is ... non-trivial. Which program(ming language) are you planning to use? I have a C 'comment stripper' which can handle the C /* ... */ comments; it is not designed to handle # comments (though it does handle C++ // comments OK). And it has an inverse mode - print the comments and not the non-comment material. But it is a non-negligible amount of C code that does that. Commented Feb 9, 2012 at 0:34
  • I'm writing a small syntax highlighter in ObjC but I thought there would be a generic regex rather than searching char by char :/ Commented Feb 9, 2012 at 0:39
  • Excluding character strings is one major source of complexity that a regex does not handle easily. The full semantics of C comments are horrid. The slash and star that start the comment can be separated by an arbitrary number of backslash-newline character pairs, for example; ditto for the star-slash at the end of the comment. Technically, a C++ // comment can have an arbitrary number of backslash-newline pairs in between the two slashes. Any regex therefore has to be in a language where you are not reading 'one line at a time' for the C-style comments (the #...EOL comments are easier). Commented Feb 9, 2012 at 0:44
  • 2
    Also, consider the following. You probably want #/* to be a single line comment. And you probably don't want #*/ to close an existing comment. Commented Feb 9, 2012 at 0:50

2 Answers 2

7

Here's one possibility (it does have an achilles heel that i'll get to):

(#[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n]|/\*([^*]|\*(?!/))*?\*/)(?=[^"]*(?:"[^"]*"[^"]*)*$)

In action here

With the GLOBAL and DOTALL flags, but not the MULTILINE flag.

Explanation of the regex:

(
  #[^"\n\r]*                         Hash mark followed by non-" and non-end-of-line
    (?:"[^"\n\r]*"[^"\n\r]*)*        If any quotes in the comment, they must be balanced
    [\r\n]                           Followed by end-of-line ($ except we 
                                      don't have multiline flag)

  |                                  OR
  /\*([^*]|\*(?!/))*?\*/             /* xxx */ sort of comment
  )                                  BOTH FOLLOWED BY
(?=[^"]*(?:"[^"]*"[^"]*)*$)           only a *balanced* number of quotes for the 
                                      *rest of the code :O!*

However, this relies on balanced quotes being used throughout the text (it also doesn't take into account escaped quotes, but it's easy enough to modify the regex to take that into account).

If a user has a comment with a " in it that isn't balanced...boom. You're screwed!

Regex is generally not recommended by things like HTML/code parsing, but if you can rely on the fact that quotes have to balance when you define a string, etc, you can sometimes get away with it.

Since you are also parsing comments, which have no set structure (ie you are not guaranteed that quotes within comments will be balanced), you won't be able to find a regex solution that works here.

Anything you think up can be outwitted by an unbalanced quote in a comment somewhere (say the comment was # remove all the " marks), or by multiline strings (where on a given line there may be unbalanced quotes).

Bottom line - you can probably make a regex that will work in most cases, but not for all. To get something watertight you'll have to write some code.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 because it works better than all I tried for hours :D. Can you try the other way? maybe finding a string thats not in a comment is easier. (at least a single line string)
0

I would use two regular expressions for this:

  1. /(\/\*.*?\/)|(#.+?$)/m to find all the comments, the "m" modifier is to enable multiline
  2. /"[^"]*?"/ to find all the strings

If you apply the highlighting to the comments first and only after to the strings, the invalid comments should disappear.

4 Comments

this is exactly what I wanted to avoid ^^. I have currently a list of regex to load and strings and comments overriding each other. So if I apply a color for the comments first and then another color for the strings, they would be also colored in comments
You're right, didn't think of that :) Not sure if you can do without a parser though.
Have you considered checking if each match contain any of the others and override them in that case?
I think would be too overwhelming. Lets say I have a large document and the code gets colorized after each single char the user type. I think if I don't get an answer I will simply display strings colored inside comments. (not the best solution, but faster than the mix of both)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.