3

Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.

'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\; but you should get it";'

Regex: \s*([^;]+[^\\])\s*;

I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.

How can I modify this expression to only split on the non-escaped semicolons?

4
  • What happened when you tried the \; in the sample above? Seems like it should fail to match until after that point. Is the [^\\] in your pattern an attempted workaround for this issue, or does that have some significance besides dealing with \;? Commented Dec 8, 2011 at 18:07
  • May the quoted strings contain escaped quotes? i.e. key:" \" "; And may the quoted strings contain non-escaped semicolons? i.e. key:" ; ";? Commented Dec 8, 2011 at 18:17
  • Justin, it was an attempted work around for this issue. The first two groups are correctly parsed, but the odd (last) group ends up just being 'but you should get it";', chopping off the block before the escaped semicolon. Commented Dec 8, 2011 at 18:23
  • Ridgerunner, the semicolons and quotes in the string must be escaped. Commented Dec 8, 2011 at 18:24

2 Answers 2

2

The basic version of this is where you want to ignore any ; that's preceded by a backslash, regardless of anything else. That's relatively simple:

\s*([^;]*[^;\\]);

What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:

"You may want to split here\\;"
"But not here\;"

If that's something you want to take into account, try this (edited):

\s*((?:[^;\\]|\\.)+);

Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:

"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"

Each pair of doubled backslashes would be treated as a literal \. That means a ; would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:

#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'

Hence the different parts of the pattern:

\s*            #Whitespace
((?:
    [^;\\]     #One character that's not ; or \
  |            #Or...
    \\.        #A backslash followed by any character, even ; or another backslash
)+);           #Repeated one or more times, followed by ;

Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.

Sign up to request clarification or add additional context in comments.

4 Comments

Do we really need negative look behind or I am missing something?
@Abhijit - If you're not differentiating between \; and \\;, then you don't need one (hence my second pattern, which doesn't have one). If you want to treat \\; as a literal backslash and split on the ; as normal, then I don't see any other way of doing it.
@JohnDoe - So negative lookbehinds are allowed, but not if they're variable-length like my last pattern. In that case, I can't think of a regex that will detect escaped backslashes in the input.
@JohnDoe - I think I spoke too soon. Check the edited pattern.
1

If the string may contain semicolons and escaped quotes (or escaped anything), I would suggest parsing each valid key:"value"; sequence. Like so:

import re
s = r'''
    key1:"this is a test phrase";
    key2:"this is another test phrase";
    key3:"ok this is a gotcha\; but you should get it";
    key4:"String with \" escaped quote";
    key5:"String with ; unescaped semi-colon";
    key6:"String with \\; escaped-escape before semi-colon";
    '''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)

Note that this correctly handles any escapes within the double quoted string.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.