3

I am trying to work out a good regular expression for a python comment(s) that is located within a long string. So far I have

regex:

#(.?|\n)*

string:

'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

I feel like there is a much better way to get all of the individual comments from the string, but I am not an expert in regular expressions. Does anyone have a better solution?

3
  • 1
    I don't think that this is doable with a python regex, since the # may be into something like a="#foo". Even more complex situations with more opening and closing \" or \' characters are possible, so it would not wonder me, if someone can show that it's not doable with regex alone by the pumping lemma. @alecxe has a better solution. Commented Jul 18, 2014 at 16:39
  • Why not just split the string on newlines using str = str.split('\n') and then iterate over the result? Commented Jul 18, 2014 at 17:37
  • @RevanProdigalKnight because I have to compare the previous and following n characters to the regex findings so that I can mutate 2 strings into a 3rd string. I have tried that method and when you split lines you increase the complexity. this is all because I am doing code transforms on a file and then after the transform i have to add the comments back in to the appropriate place. Commented Jul 18, 2014 at 17:44

3 Answers 3

1

Get the comments from matched group at index 1.

(#+[^\\\n]*)

DEMO

Sample code:

import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."

re.findall(p, test_str)

Matches:

1.  ### this is a comment
2.  # this call outputs an xml stream of the current parameter dictionary.
3.  # wow another comment
Sign up to request clarification or add additional context in comments.

2 Comments

I haven't tested enough cases yet but so far this is looking like exactly what I needed.
Go ahead and test all the cases.
1

Since this is a python code in the string, I'd use tokenize module to parse it and extract comments:

import tokenize
import StringIO

text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
    if toktype == tokenize.COMMENT:
        print ttext

Prints:

### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment

Note that the code in the string has a syntax error: missing : after the do_something() function definition.

Also, note that ast module would not help here, since it doesn't preserve comments.

5 Comments

I have tried that the problem is that tokenize.untokenize is not a dependable function since I am using this to transform code.
if the ast module did preserve comments I would have 3 weeks of my life back.
@baallezx could you please elaborate a bit more about why you cannot use tokenize here? Thank you.
@alexce there are many issues with tokenize.untokenize like it will break if it comes across a line continuation character `` plus a few others that i can not think of off the top of my head. I will try to use it again by using the token.start and token.end as references to placements within the string and get back to you. Maybe that will work.
@baallezx thank you, yeah, give it a try. Probably, it would be better to stick with this approach and solve the issues you would have with it either by yourself, or with the help of SO community by creating separate questions. I'm still pretty sure this is the most robust approach (especially comparing to a regex solution).
1

Regex will work fine if you do two things:

  1. Remove all string literals (since they can contain # characters).

  2. Capture everything that starts with a # character and proceeds to the end of the line.

Below is a demonstration:

>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>

re.sub removes anything of the form "..." or '...'. This saves you from having to worry about comments that are inside string literals.

(?s) sets the dot-all flag, which allows . to match newline characters.

Lastly, re.findall gets everything that starts with a # character and proceeds to the end of the line.


For a more complete test, place this sample code in a file named test.py:

# Comment 1  
for i in range(10): # Comment 2
    print('#foo')
    print("abc#bar")
    print("""
#hello
abcde#foo
""")  # Comment 3
    print('''#foo
    #foo''')  # Comment 4

The solution given above still works:

>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.