Python regular expression for a comment in a long string

Question

I am trying to work out a good regular expression for a python comment(s) that is located within a long string. So far I have

regex:

#(.?|\n)*

string:

'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

I feel like there is a much better way to get all of the individual comments from the string, but I am not an expert in regular expressions. Does anyone have a better solution?

I don't think that this is doable with a python regex, since the # may be into something like a="#foo". Even more complex situations with more opening and closing \" or \' characters are possible, so it would not wonder me, if someone can show that it's not doable with regex alone by the pumping lemma. @alecxe has a better solution. — quant
– quant, Commented Jul 18, 2014 at 16:39
Why not just split the string on newlines using str = str.split('\n') and then iterate over the result? — RevanProdigalKnight
– RevanProdigalKnight, Commented Jul 18, 2014 at 17:37
@RevanProdigalKnight because I have to compare the previous and following n characters to the regex findings so that I can mutate 2 strings into a 3rd string. I have tried that method and when you split lines you increase the complexity. this is all because I am doing code transforms on a file and then after the transform i have to add the comments back in to the appropriate place. — baallezx
– baallezx, Commented Jul 18, 2014 at 17:44

Braj · Accepted Answer · 2014-07-18 18:00:59Z

1

Get the comments from matched group at index 1.

(#+[^\\\n]*)

DEMO

Sample code:

import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."

re.findall(p, test_str)

Matches:

1.  ### this is a comment
2.  # this call outputs an xml stream of the current parameter dictionary.
3.  # wow another comment

answered Jul 18, 2014 at 18:00

Braj

46.9k5 gold badges63 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

baallezx Over a year ago

I haven't tested enough cases yet but so far this is looking like exactly what I needed.

Braj Over a year ago

Go ahead and test all the cases.

alecxe · Accepted Answer · 2014-07-18 16:37:41Z

1

Since this is a python code in the string, I'd use tokenize module to parse it and extract comments:

import tokenize
import StringIO

text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
    if toktype == tokenize.COMMENT:
        print ttext

Prints:

### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment

Note that the code in the string has a syntax error: missing : after the do_something() function definition.

Also, note that ast module would not help here, since it doesn't preserve comments.

answered Jul 18, 2014 at 16:37

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

5 Comments

baallezx Over a year ago

I have tried that the problem is that tokenize.untokenize is not a dependable function since I am using this to transform code.

baallezx Over a year ago

if the ast module did preserve comments I would have 3 weeks of my life back.

alecxe Over a year ago

@baallezx could you please elaborate a bit more about why you cannot use tokenize here? Thank you.

baallezx Over a year ago

@alexce there are many issues with tokenize.untokenize like it will break if it comes across a line continuation character `` plus a few others that i can not think of off the top of my head. I will try to use it again by using the token.start and token.end as references to placements within the string and get back to you. Maybe that will work.

alecxe Over a year ago

@baallezx thank you, yeah, give it a try. Probably, it would be better to stick with this approach and solve the issues you would have with it either by yourself, or with the help of SO community by creating separate questions. I'm still pretty sure this is the most robust approach (especially comparing to a regex solution).

score 1 · Accepted Answer · 2014-07-18 17:33:44Z

Regex will work fine if you do two things:

Remove all string literals (since they can contain # characters).
Capture everything that starts with a # character and proceeds to the end of the line.

Below is a demonstration:

>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>

re.sub removes anything of the form "..." or '...'. This saves you from having to worry about comments that are inside string literals.

(?s) sets the dot-all flag, which allows . to match newline characters.

Lastly, re.findall gets everything that starts with a # character and proceeds to the end of the line.

For a more complete test, place this sample code in a file named test.py:

# Comment 1  
for i in range(10): # Comment 2
    print('#foo')
    print("abc#bar")
    print("""
#hello
abcde#foo
""")  # Comment 3
    print('''#foo
    #foo''')  # Comment 4

The solution given above still works:

>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>

Collectives™ on Stack Overflow

Python regular expression for a comment in a long string

3 Answers 3

2 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related