Regex will work fine if you do two things:
Remove all string literals (since they can contain # characters).
Capture everything that starts with a # character and proceeds to the end of the line.
Below is a demonstration:
>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n # this call outputs an xml stream of the current parameter dictionary.\n paramtertools.print_header(params)\n\nfor i in xrange(256): # wow another comment\n print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>
re.sub removes anything of the form "..." or '...'. This saves you from having to worry about comments that are inside string literals.
(?s) sets the dot-all flag, which allows . to match newline characters.
Lastly, re.findall gets everything that starts with a # character and proceeds to the end of the line.
For a more complete test, place this sample code in a file named test.py:
# Comment 1
for i in range(10): # Comment 2
print('#foo')
print("abc#bar")
print("""
#hello
abcde#foo
""") # Comment 3
print('''#foo
#foo''') # Comment 4
The solution given above still works:
>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>
str = str.split('\n')and then iterate over the result?ncharacters to the regex findings so that I can mutate 2 strings into a 3rd string. I have tried that method and when you split lines you increase the complexity. this is all because I am doing code transforms on a file and then after the transform i have to add the comments back in to the appropriate place.