You want to tokenize your input. The simplest tokenizer needed to parse your balanced expressions could be a simple regular expression here, splitting on ( and ), ignoring whitespace:
import re
_tokenizer = re.compile(r'\s*([()])\s*').split
def tokenize(s):
return filter(None, _tokenizer(s))
and use tokenize()) instead of iter():
def parse_conditions(expr):
def _helper(tokens):
items = []
for item in tokens:
if item == '(':
result, closeparen = _helper(tokens)
if not closeparen:
raise ValueError("Unbalanced parentheses")
items.append(result)
elif item == ')':
return items, True
else:
items.append(item)
return items, False
return _helper(tokenize(expr))[0]
The filter(None, ...) call filters out the empty strings that re.split() produces at points where the input starts or ends with a ( or ), or if two ( or ) characters directly follow one another.
Demo:
>>> s = 'v1 and (v2 and (v3 or v4))'
>>> parse_conditions(s)
['v1 and', ['v2 and', ['v3 or v4']]]
To split out operators too, you can either add valid operators to the splitting expression, or just add whitespace as a delimiter.
Splitting on whitespace, where we don't include the spaces in the tokens:
_tokenizer = re.compile(r'(?:([()])|\s+)').split
produces:
>>> parse_conditions(s)
['v1', 'and', ['v2', 'and', ['v3', 'or', 'v4']]]
while focusing on valid operators would be:
_tokenizer = re.compile(r'\s*([()]|\b(?:or|and)\b)\s*').split
and for your sample input that produces the same result.
Note that your code has a bug; it won't detect an unbalanced closing parenthesis:
>>> parse_conditions('foo) and bar')
['foo']
You need to validate that the first _helper() call returns False for the second element in the returned tuple. Instead of return _helper(tokenize(expr))[0], use:
items, closing = _helper(tokenize(expr))
if closing: # indicating there was a closing ) without opening (
raise ValueError("Unbalanced parentheses")
return items
Finally, I'd not use recursion here, and instead use an explicit stack to replace the call stack on which recursion builds. Your own stack is only limited by memory, where a recursion stack is limited to a fixed size (1000 by default):
def parse_conditions(expr):
stack = [] # or a `collections.deque()` object, which is a little faster
top = items = []
for token in tokenize(expr):
if token == '(':
stack.append(items)
items.append([])
items = items[-1]
elif token == ')':
if not stack:
raise ValueError("Unbalanced parentheses")
items = stack.pop()
else:
items.append(token)
if stack:
raise ValueError("Unbalanced parentheses")
return top
You might be interested in looking at the tokenize module, which implements a tokenizer for Python code; the source code uses a series of regular expressions to split out Python source code into tokens (where a token not only contains the token text, but also a token type, the start and end positions (a column, line tuple) and the full line it came from).
["v1", "and", "(", "v2", "and", "(", "v3", "or", "v4", ")", ")"]. Next, parse the list of tokens instead of the list of characters.sqlparsebut was not able to find out how to parse expressions similar to the one I've posted and maintain expressions' precedence. If you could provide an example I would appreciate it.