0

I'm beginning my adventure with regular expressions. I'm interested in splitting specially formatted strings. If a letter is not inside parentheses it should become a different element of output list. Letters grouped inside parentheses should be put together.

Samples:

my string => wanted list

  • "ab(hpl)x" => ['a', 'b', 'hpl', 'x']
  • "(pck)(kx)(sd)" => ['pck', 'kx', 'sd']
  • "(kx)kxx(kd)" => ['kx', 'k', 'x', 'x', 'kd']
  • "fghk" => ['f', 'g', 'h', 'k']

How can it be achieved with regular expressions and re.split? Thanks in advance for your help.

2 Answers 2

5

This cannot be done with re.split, as it would require splitting on zero length matches.

From http://docs.python.org/library/re.html#re.split:

Note that split will never split a string on an empty pattern match.

Here is an alternative:

re.findall(r'(\w+(?=\))|\w)', your_string)

And an example:

>>> for s in ("ab(hpl)x", "(pck)(kx)(sd)", "(kx)kxx(kd)", "fghk"):
...     print s, " => ", re.findall(r'(\w+(?=\))|\w)', s)
... 
ab(hpl)x  =>  ['a', 'b', 'hpl', 'x']
(pck)(kx)(sd)  =>  ['pck', 'kx', 'sd']
(kx)kxx(kd)  =>  ['kx', 'k', 'x', 'x', 'kd']
fghk  =>  ['f', 'g', 'h', 'k']
Sign up to request clarification or add additional context in comments.

3 Comments

@Maciej Ziarko: Please note that this answer, by using '\w' and no lookbehind assertion assumes that parenthesis will always be balanced and that you never have digits and underscores in your data. Given your test data this is a fair assumption, so +1.
Yes, they will always be balanced. Any other characters in my data are small letters. I like both your answers and I voted them up. BTW: Can you recommend any good regular expressions tutorial/book with nice examples?
I mostly used regular-expressions.info for learning, and rubular.com is very useful for quick testing on your regular expressions.
1

You want findall not split. Use this re: r'(?<=\()[a-z]+(?=\))|[a-z]', which works for all your test cases.

>>> test_cases = ["ab(hpl)x", "(pck)(kx)(sd)", "(kx)kxx(kd)", "fghk"]
>>> pat = re.compile(r'(?<=\()[a-z]+(?=\))|[a-z]')
>>> for test_case in test_cases:
...     print "%-13s  =>  %s" % (test_case, pat.findall(test_case))
...
ab(hpl)x       =>  ['a', 'b', 'hpl', 'x']
(pck)(kx)(sd)  =>  ['pck', 'kx', 'sd']
(kx)kxx(kd)    =>  ['kx', 'k', 'x', 'x', 'kd']
fghk           =>  ['f', 'g', 'h', 'k']

edit:

Replace [a-z] with \w if you want to match upper and lower case letters, numbers, and underscore. You can remove the lookbehind assertion (?<=\() if your parenthesis will never be unbalanced ("abc(def").

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.