Extract non-comment portion of code with Python Regex

Question

I am trying to extract the "non-comment" portion of a c-code using Python. So far my code can extract "non_comment" in these examples, and if it couldn't find one, it simply return ""

// comment
/// comment
non_comment;
non_comment; /* comment */
non_comment; // comment
/* comment */ non_comment;
/* comment */ non_comment; /* comment */
/* comment */ non_comment; // comment

Here is the source code and I use doctest to unit test different scenarios

import re
import doctest

def remove_comment(expr):
  """
  >>> remove_comment('// comment')
  ''
  >>> remove_comment('/// comment')
  ''
  >>> remove_comment('non_comment;')
  'non_comment;'
  >>> remove_comment('non_comment; /* comment */')
  'non_comment;'
  >>> remove_comment('non_comment; // comment')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment;')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment; /* comment */')
  'non_comment;'
  >>> remove_comment('/* comment */ non_comment; // comment')
  'non_comment;'
  """
  expr = expr.strip()
  if expr.startswith(('//', '///')):
      return ''
  # throw away /* ... */ comment, and // comment at the end
  pattern = r'(/\*.*\*/\W*)?(\w+;)(//|/\*.*\*/\W*)?'
  r = re.search(pattern, expr)
  return r.group(2).strip() if r else ''    

doctest.testmod()

However, I don't like the code somehow, and believe there should be a better way to handle this. Does anyone know better way to do it? Thanks!

you could remove the comments instead: stackoverflow.com/questions/241327/… — Jean-François Fabre
– Jean-François Fabre ♦, Commented Jul 31, 2018 at 21:06
Can you try (?:\/\/\/?|\/\*)\s*\w+\s*(?:$|\*\/)?|(\w+;) here ? — Paolo
– Paolo, Commented Jul 31, 2018 at 21:07

score 1 · Accepted Answer · 2018-08-01 20:30:15Z

To find comments, you also have to find quoted items since comment syntax can
be embedded in a string.
The reverse is also true, strings can be embedded in comments.

The regex below captures comments in group 1 and non-comments in group 2.

So, to remove comments -

re.sub(r'(?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|\'[^\'\\]*(?:\\[\S\s][^\'\\]*)*\'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:/\*|//))|[^/"\'\\\r\n]*))+|[^/"\'\\\r\n]+)+|[\S\s][^/"\'\\\r\n]*)', r'\2', sourceTxt)

To just get the non-comments you can just match all, saving group 2 items to an array.

This regex preserves formatting and uses assertions.
There is also a stripped down version available without formatting, that does not
use assertions.

Demo PCRE: https://regex101.com/r/UldYK5/1
Demo Python: https://regex101.com/r/avfSfB/1

Readable regex

    # raw:   (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
    # delimited:  /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/

    (?m)                             # Multi-line modifier
    (                                # (1 start), Comments
         (?:
              (?: ^ [ \t]* )?                  # <- To preserve formatting
              (?:
                   /\*                              # Start /* .. */ comment
                   [^*]* \*+
                   (?: [^/*] [^*]* \*+ )*
                   /                                # End /* .. */ comment
                   (?:                              # <- To preserve formatting
                        [ \t]* \r? \n
                        (?=
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                   )?
                |
                   //                               # Start // comment
                   (?:                              # Possible line-continuation
                        [^\\]
                     |  \\
                        (?: \r? \n )?
                   )*?
                   (?:                              # End // comment
                        \r? \n
                        (?=                              # <- To preserve formatting
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                     |  (?= \r? \n )
                   )
              )
         )+                               # Grab multiple comment blocks if need be
    )                                # (1 end)

 |                                 ## OR

    (                                # (2 start), Non - comments
         # Quotes
         # ======================
         (?:                              # Quote and Non-Comment blocks
              "
              [^"\\]*                          # Double quoted text
              (?: \\ [\S\s] [^"\\]* )*
              "
           |                                 # --------------
              '
              [^'\\]*                          # Single quoted text
              (?: \\ [\S\s] [^'\\]* )*
              '
           |                                 # --------------

              (?:                              # Qualified Linebreak's
                   \r? \n
                   (?:
                        (?=                              # If comment ahead just stop
                             (?: ^ [ \t]* )?
                             (?: /\* | // )
                        )
                     |                                 # or,
                        [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                                         # or line continuation (escape + newline)
                   )
              )+
           |                                 # --------------
              [^/"'\\\r\n]+                    # Chars which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)

         )+                               # Grab multiple instances

      |                                 # or,
         # ======================
         # Pass through

         [\S\s]                           # Any other char
         [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)

    )                                # (2 end), Non - comments

If you use a particular engine that doesn't support assertions,
then you'd have to use this.
This won't preserve formatting though.

Usage same as above.

    # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


    (                                # (1 start), Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )                                # (1 end)
 |  
    (                                # (2 start), Non - comments 
         "
         (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
         "
      |  '
         (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
         ' 
      |  [\S\s]                           # Any other char
         [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)
    )                                # (2 end)

I'm sorry, but that level of complexity is unnecessary. Sure there could be comment syntax within a string, but how often does that happen? In most cases you don't need a 300 character regex pattern.
@emsimpson92 - Just for you. For the sake of performance gains, I just made it more complex. I also added below the preserve formatting one, the standard comment stripper regex for the last 20 years (from a Perl god) from which the extremely complex one is derived. Enjoy these, I may not be around much longer...

emsimpson92 · Accepted Answer · 2018-07-31 21:18:24Z

0

Instead of extracting all the non-comments, try removing the comments instead by replacing them with "".

Demo

\/\/.*|\/\*[^*]*\*\/ is the pattern. It will capture anything surrounded by /*...*/ or starting with //

answered Jul 31, 2018 at 21:18

emsimpson92

1,7781 gold badge11 silver badges25 bronze badges

6 Comments

Scoodood Over a year ago

Wow this is excellent! I just came out something similar and did it with 2 lines instead of 1 line. I use re.sub(r'//.*', '', expr), followed by re.sub(r'/\*[\w ]*\*/', '', expr). But yours is even better. By the way, what does it mean by [^*]?

emsimpson92 Over a year ago

That means it will capture any character that isn't *. Replace [^*] with . and you'll see why I did that.

Scoodood Over a year ago

Good to know but take a while for me to digest. I will keep [\w ]* for now because it's more explicit and natural to my brain. Thanks!

emsimpson92 Over a year ago

A more simple approach would be \/\/.*|\/\*.*?\*\/

kantal Over a year ago

It fails on a line as ' printf("%s","/*non comment*/"); #non comment '

|

Collectives™ on Stack Overflow

Extract non-comment portion of code with Python Regex

2 Answers 2

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related