1

I'm working with a Python source code corpus. I would like the strings to be replaced with STRING. Python strings are annoying because they allow so many delimiters. Here is what I've tried and the issues I've run into.

  • r'"(\\"|[^"])*"' and r"'(\\'|[^'])*'"

    This doesn't work because if a string contains the opposite delimiter.

  • r'(\'|"|\'\'\'|""")(?:\\\1|(?!\1))*\1'

    This was my attempt at a catch all, but the lookahead doesn't work. I basically wanted r'(\'|"|\'\'\'|""")(?:\\\1|[^\1])*\1' if that were possible.

  • Multiline strings mess stuff up. You can't use [^"""] because """ is not one character.

  • Strings that contain the other delimiters like "'".
  • Strings that escape the delimiter like '\''.

These are the kinds of strings that need to be matched. The entire block is a string with the delimiters included.

  • '/$\'"`'
  • '\\'
  • '^__[\'\\"]([^\'\\"]*)[\'\\"]'
  • "Couldn't do that"

These are all valid strings, but you can probably see where it might be hard to match them. Essentially, I want this:

def hello_world():
    print("'blah' \"blah\"")

To become:

def hello_world():
    print( STRING )

For simplicity sake, let's say the entire Python file is inside of a string. Right now I am reading a file line by line, but I could treat it as one string if necessary. It really doesn't matter how the file is read. If your solution requires a specific method, I will use that. I am not sure this problem can be solved entirely with regex. If you have a solution that involves other code, that would be much appreciated as well.

10
  • 4
    Why not process this at the AST level, rather than trying to regex the source? Commented Feb 28, 2020 at 20:37
  • I am also considering that approach, but I want to test this approach as well. Commented Feb 28, 2020 at 20:38
  • Why not join the four regexes for """, ''', " and ' with | between them? Commented Feb 28, 2020 at 20:40
  • I've tried that, but I am having trouble using a lookahead. Commented Feb 28, 2020 at 20:41
  • 1
    @Mike Can you show an example of a problematic f-string? Commented Feb 28, 2020 at 21:11

1 Answer 1

1

You can try a regex that matches quoted strings but allows escaping:

[rR]?(?:'([^\\']*(?:\\.[^\\']*)*)'|"([^\\"]*(?:\\.[^\\"]*)*)")

Demo

While this may capture the majority of strings I am pretty sure there are still some exceptions.

This is based on J. Friedl's unrolling the loop technique:

Unrolling the Loop (using double quotes)

"                              # the start delimiter
 ([^\\"]*                      # anything but the end of the string or the escape char
         (?:\\.                #     the escape char preceding an escaped char (any char)
               [^\\"]*         #     anything but the end of the string or the escape char
                      )*)      #     repeat
                             " # the end delimiter
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.