Recursive parentheses parser for expressions of strings

Question

Assuming that I have an expression similar to the one below (in reality this is an SQL statement):

"v1 and (v2 and (v3 or v4))"

I want to parse it in order to process strings and maintain parentheses' precedence. To do so, I've used the following recursive function

def parse_conditions(expr):
    def _helper(iter):
        items = []
        for item in iter:
            if item == '(':
                result, closeparen = _helper(iter)
                if not closeparen:
                    raise ValueError("Unbalanced parentheses")
                items.append(result)
            elif item == ')':
                return items, True
            else:
                items.append(item)
        return items, False
    return _helper(iter(expr))[0]

that gives the following output:

print(parse_conditions("v1 and (v2 and (v3 or v4))"))

['v', '1', ' ', 'a', 'n', 'd', ' ', ['v', '2', ' ', 'a', 'n', 'd', ' ', ['v', '3', ' ', 'o', 'r', ' ', 'v', '4']]]

The expected output however, would either be

['v1 and', ['v2 and', ['v3 or v4']]]

or

['v1', and', ['v2', and', ['v3', 'or', 'v4']]]

Any thoughts how to achieve this?

It's a two-step process. First, tokenize the input string to produce ["v1", "and", "(", "v2", "and", "(", "v3", "or", "v4", ")", ")"]. Next, parse the list of tokens instead of the list of characters. — chepner
– chepner, Commented Mar 2, 2019 at 15:08
If you're parsing SQL, and unless you really want to write a parser, you should take a look at sqlparse (github.com/andialbrecht/sqlparse), even if you just end up using their lexer/tokenizer (github.com/andialbrecht/sqlparse/blob/master/sqlparse/lexer.py) — thebjorn
– thebjorn, Commented Mar 2, 2019 at 15:53
@thebjorn I've had a look at sqlparse but was not able to find out how to parse expressions similar to the one I've posted and maintain expressions' precedence. If you could provide an example I would appreciate it. — Giorgos Myrianthous
– Giorgos Myrianthous, Commented Mar 2, 2019 at 15:56

Martijn Pieters · Accepted Answer · 2019-03-02 15:54:03Z

5

You want to tokenize your input. The simplest tokenizer needed to parse your balanced expressions could be a simple regular expression here, splitting on ( and ), ignoring whitespace:

import re

_tokenizer = re.compile(r'\s*([()])\s*').split
def tokenize(s):
    return filter(None, _tokenizer(s))

and use tokenize()) instead of iter():

def parse_conditions(expr):
    def _helper(tokens):
        items = []
        for item in tokens:
            if item == '(':
                result, closeparen = _helper(tokens)
                if not closeparen:
                    raise ValueError("Unbalanced parentheses")
                items.append(result)
            elif item == ')':
                return items, True
            else:
                items.append(item)
        return items, False
    return _helper(tokenize(expr))[0]

The filter(None, ...) call filters out the empty strings that re.split() produces at points where the input starts or ends with a ( or ), or if two ( or ) characters directly follow one another.

Demo:

>>> s = 'v1 and (v2 and (v3 or v4))'
>>> parse_conditions(s)
['v1 and', ['v2 and', ['v3 or v4']]]

To split out operators too, you can either add valid operators to the splitting expression, or just add whitespace as a delimiter.

Splitting on whitespace, where we don't include the spaces in the tokens:

_tokenizer = re.compile(r'(?:([()])|\s+)').split

produces:

>>> parse_conditions(s)
['v1', 'and', ['v2', 'and', ['v3', 'or', 'v4']]]

while focusing on valid operators would be:

_tokenizer = re.compile(r'\s*([()]|\b(?:or|and)\b)\s*').split

and for your sample input that produces the same result.

Note that your code has a bug; it won't detect an unbalanced closing parenthesis:

>>> parse_conditions('foo) and bar')
['foo']

You need to validate that the first _helper() call returns False for the second element in the returned tuple. Instead of return _helper(tokenize(expr))[0], use:

items, closing = _helper(tokenize(expr))
if closing:  # indicating there was a closing ) without opening (
    raise ValueError("Unbalanced parentheses")
return items

Finally, I'd not use recursion here, and instead use an explicit stack to replace the call stack on which recursion builds. Your own stack is only limited by memory, where a recursion stack is limited to a fixed size (1000 by default):

def parse_conditions(expr):
    stack = []  # or a `collections.deque()` object, which is a little faster
    top = items = []
    for token in tokenize(expr):
        if token == '(':
            stack.append(items)
            items.append([])
            items = items[-1]
        elif token == ')':
            if not stack:
                raise ValueError("Unbalanced parentheses")
            items = stack.pop()  
        else:
            items.append(token)
    if stack:
        raise ValueError("Unbalanced parentheses")
    return top

You might be interested in looking at the tokenize module, which implements a tokenizer for Python code; the source code uses a series of regular expressions to split out Python source code into tokens (where a token not only contains the token text, but also a token type, the start and end positions (a column, line tuple) and the full line it came from).

edited Mar 2, 2019 at 15:54

answered Mar 2, 2019 at 15:09

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Martijn Pieters Over a year ago

Not if you add your operators to the split group; e.g. [()<>]. Tokenisation just means splitting out an opaque string into interesting components, and if your components are not always space delimited then expand the code regex to handle single character and muti-character operators, keywords and other components specifically.

Martijn Pieters Over a year ago

I note that there was no mention of a more complex grammar in your question however, if your real problem has additional complexities, can I suggest you ask a new question?

Giorgos Myrianthous Over a year ago

Would you mind if I edit this question and include some more complex examples? I just wanted to include a minimal, simple example - my bad.

Martijn Pieters Over a year ago

Kind-of, yes. That’s called “shifting the goal posts”, moving from problem to new problem.

Giorgos Myrianthous Over a year ago

Here's the new question. I would appreciate any help.

|

Collectives™ on Stack Overflow

Recursive parentheses parser for expressions of strings

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related