1

I'm writing a regular expression to parse arguments in a fashion similar to shell arguments, with spaces and quoted strings as the delimiters, as well as backslash escaping. This seems to work on RegexPal:

(?:(["'])(?:\\(?:\\\\)?\1|\\\\|.)*?\1|(?:\\(?:\\\\)?\s|\\\\|\S)+)

Here is a more readable version of that:

(?:(["'])(?:        # Match a double or single quote followed by
     \\(?:\\\\)?\1  #   an odd number of backslashes, then the same quote
    |\\\\           #   or two backslashes
    |.              #   or anything else  
    )*?\1           # any number of times (lazily) followed by the same quote,
|(?:                # OR
     \\(?:\\\\)?\s  #   an odd number of backslashes, then whitespace
    |\\\\           #   or two backslashes
    |\S             #   or any non-whitespace
 )+                 # any number of times.
)

I've tried putting this into Python with re.findall, but the output is nonsense:

>>> re.findall(
... r"(?:([\"'])(?:\\(?:\\\\)?\1|\\\\|.)*?\1|(?:\\(?:\\\\)?\s|\\\\|\S)+)",
... r'the quick brown\ fox jumps "over the" lazy\\ dog')
['', '', '', '', '"', '', '']

RegexPal, on the other hand, shows the correct result:

[the] [quick] [brown\ fox] [jumps] ["over the"] [lazy\\] [dog]

Am I forgetting to format the pattern a certain way for Python? Or does Python interpret regex differently in some way? I have no idea why the only non-empty match would be a double-quote, and I've confirmed that the pattern itself works the way it should.

2
  • The shlex module might be of interest to you. Commented Jun 24, 2011 at 20:31
  • I started with the shlex module when I was writing this code, but I've found that it's not flexible enough for my purposes. I need to be able to split the arguments apart and preserve any surrounding quotes, backslashes, and so on. Commented Jun 24, 2011 at 20:36

1 Answer 1

2

It looks like everything is inside a non-capturing group. So you get matches, just no matching content.

Sign up to request clarification or add additional context in comments.

3 Comments

I tried making the outermost group capturing, but that breaks the internal backreferences. If you know of a way to fix that (I've tried changing \1 to \2, to no avail), I'm all ears.
@Fraxtil, the backreferences are numbered by the opening brackets. The first opening bracket is \1, the second is \2 ... .
I'm not sure why it didn't work the first time, but I removed the leading ?: and substituted \2 for \1 again, and it works flawlessly this time. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.