1

How can I define a regex to find multiline comments in python that contain the word "xyz". Example for a string that should match:

"""
blah blah
blah
xyz
blah blah
"""

I tried this regex:

"""((.|\n)(?!"""))*?xyz(.|\n)*?"""

(grep -i -Pz '"""((.|\n)(?!"""))?xyz(.|\n)?"""')

but it was not good enough. for example, for this input

 """
    blah blah blah
    blah
"""

   # xyz
               
 def foo(self):
"""
blah
"""

it matched this string:

"""

   # xyz
               
 def foo(self):
"""

The expected behavior in this case it to not match anything since "xyz" is not inside a comment block.

I wanted it to only find "xyz" within opening quotes and closing quotes, but the string it matches is not inside a quotes block. It matches a string that starts with a quote, has "xyz" in it and ends with a quote, but the matched string is NOT inside a python comment block.

Any idea how to get the required behavior from this regex?

2
  • 3
    parsing programming languages with regexes is a bad idea mostly. Have you considered the ast module? Commented Nov 16, 2022 at 11:47
  • @john1994 – Are you really demanding to get the required behavior from this regex? How about a quite different approach? And what do you want as output - the whole multiline string, or just the line with "xyz"? Commented Nov 16, 2022 at 14:26

1 Answer 1

1

The main challenge is keeping the """ ... """ balance of inside and outside a comment.
Here an idea with PCRE (e.g. PyPI regex with Python) or grep -Pz (like in your example).

(?ims)^"""(?:(?:[^"]|"(?!""))*?(xyz))?.*?^"""(?(1)|(*SKIP)(*F))

See this demo at regex101 (used with i ignorecase, m multiline and s dotall flags)

This works because the searchstring is matched optional to prevent backtracking into another match and loosing overall balance. The most simple pattern for keeping the balance would be """.*?""". But as soon as you want to match some substring inside, the regex engine will try to succeed.

To get around this, the searchstring can be matched optionally for keeping balance by preventing backtracking. Simplified example: """([^"]*?xyz)?.*?""" VS not wanted """([^"]*?xyz).*?""".

Now to still let the matches without searchstring fail, I used a conditional afterwards together with PCRE verbs (*SKIP)(*F). If the first group fails (no searchstring inside) the match just gets skipped.


For usage with grep here is a demo at tio.run, or alternatively: pcregrep -M '(?is)pattern'
As mentioned above in Python this pattern requires PyPI regex, see a Python demo at tio.run.

Sign up to request clarification or add additional context in comments.

3 Comments

Wow that looks promising, I gotta say! But for some reason it does not work when I go "grep -P '(?ms)^"""(?:(?:[^"]|"(?!""))*?(xyz))?.*?^"""(?(1)|(*SKIP)(*F))' test " in the terminal, when test content is exactly like the text you put in the regex testing link. Am I doing something wrong here? maybe I have to use some other grep flags?
Yea I understand that, its just weird that the same regex that catches perfectly what I needed in the regex builder link, doesn't do it from the terminal
@john1994 I tried this at tio.run which seemd to work well. Another option is to use pcregrep (I edited my answer, see last line).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.