I'm writing a regular expression to parse arguments in a fashion similar to shell arguments, with spaces and quoted strings as the delimiters, as well as backslash escaping. This seems to work on RegexPal:
(?:(["'])(?:\\(?:\\\\)?\1|\\\\|.)*?\1|(?:\\(?:\\\\)?\s|\\\\|\S)+)
Here is a more readable version of that:
(?:(["'])(?: # Match a double or single quote followed by
\\(?:\\\\)?\1 # an odd number of backslashes, then the same quote
|\\\\ # or two backslashes
|. # or anything else
)*?\1 # any number of times (lazily) followed by the same quote,
|(?: # OR
\\(?:\\\\)?\s # an odd number of backslashes, then whitespace
|\\\\ # or two backslashes
|\S # or any non-whitespace
)+ # any number of times.
)
I've tried putting this into Python with re.findall, but the output is nonsense:
>>> re.findall(
... r"(?:([\"'])(?:\\(?:\\\\)?\1|\\\\|.)*?\1|(?:\\(?:\\\\)?\s|\\\\|\S)+)",
... r'the quick brown\ fox jumps "over the" lazy\\ dog')
['', '', '', '', '"', '', '']
RegexPal, on the other hand, shows the correct result:
[the] [quick] [brown\ fox] [jumps] ["over the"] [lazy\\] [dog]
Am I forgetting to format the pattern a certain way for Python? Or does Python interpret regex differently in some way? I have no idea why the only non-empty match would be a double-quote, and I've confirmed that the pattern itself works the way it should.