2

Assuming that I have an expression similar to the one below (in reality this is an SQL statement):

"v1 and (v2 and (v3 or v4))"

I want to parse it in order to process strings and maintain parentheses' precedence. To do so, I've used the following recursive function

def parse_conditions(expr):
    def _helper(iter):
        items = []
        for item in iter:
            if item == '(':
                result, closeparen = _helper(iter)
                if not closeparen:
                    raise ValueError("Unbalanced parentheses")
                items.append(result)
            elif item == ')':
                return items, True
            else:
                items.append(item)
        return items, False
    return _helper(iter(expr))[0] 

that gives the following output:

print(parse_conditions("v1 and (v2 and (v3 or v4))"))

['v', '1', ' ', 'a', 'n', 'd', ' ', ['v', '2', ' ', 'a', 'n', 'd', ' ', ['v', '3', ' ', 'o', 'r', ' ', 'v', '4']]]

The expected output however, would either be

['v1 and', ['v2 and', ['v3 or v4']]]

or

['v1', and', ['v2', and', ['v3', 'or', 'v4']]]

Any thoughts how to achieve this?

4
  • 1
    It's a two-step process. First, tokenize the input string to produce ["v1", "and", "(", "v2", "and", "(", "v3", "or", "v4", ")", ")"]. Next, parse the list of tokens instead of the list of characters. Commented Mar 2, 2019 at 15:08
  • If you're parsing SQL, and unless you really want to write a parser, you should take a look at sqlparse (github.com/andialbrecht/sqlparse), even if you just end up using their lexer/tokenizer (github.com/andialbrecht/sqlparse/blob/master/sqlparse/lexer.py) Commented Mar 2, 2019 at 15:53
  • @thebjorn I've had a look at sqlparse but was not able to find out how to parse expressions similar to the one I've posted and maintain expressions' precedence. If you could provide an example I would appreciate it. Commented Mar 2, 2019 at 15:56
  • Then you're probably better off following Martijn's code.. Commented Mar 2, 2019 at 15:59

1 Answer 1

5

You want to tokenize your input. The simplest tokenizer needed to parse your balanced expressions could be a simple regular expression here, splitting on ( and ), ignoring whitespace:

import re

_tokenizer = re.compile(r'\s*([()])\s*').split
def tokenize(s):
    return filter(None, _tokenizer(s))

and use tokenize()) instead of iter():

def parse_conditions(expr):
    def _helper(tokens):
        items = []
        for item in tokens:
            if item == '(':
                result, closeparen = _helper(tokens)
                if not closeparen:
                    raise ValueError("Unbalanced parentheses")
                items.append(result)
            elif item == ')':
                return items, True
            else:
                items.append(item)
        return items, False
    return _helper(tokenize(expr))[0] 

The filter(None, ...) call filters out the empty strings that re.split() produces at points where the input starts or ends with a ( or ), or if two ( or ) characters directly follow one another.

Demo:

>>> s = 'v1 and (v2 and (v3 or v4))'
>>> parse_conditions(s)
['v1 and', ['v2 and', ['v3 or v4']]]

To split out operators too, you can either add valid operators to the splitting expression, or just add whitespace as a delimiter.

Splitting on whitespace, where we don't include the spaces in the tokens:

_tokenizer = re.compile(r'(?:([()])|\s+)').split

produces:

>>> parse_conditions(s)
['v1', 'and', ['v2', 'and', ['v3', 'or', 'v4']]]

while focusing on valid operators would be:

_tokenizer = re.compile(r'\s*([()]|\b(?:or|and)\b)\s*').split

and for your sample input that produces the same result.

Note that your code has a bug; it won't detect an unbalanced closing parenthesis:

>>> parse_conditions('foo) and bar')
['foo']

You need to validate that the first _helper() call returns False for the second element in the returned tuple. Instead of return _helper(tokenize(expr))[0], use:

items, closing = _helper(tokenize(expr))
if closing:  # indicating there was a closing ) without opening (
    raise ValueError("Unbalanced parentheses")
return items

Finally, I'd not use recursion here, and instead use an explicit stack to replace the call stack on which recursion builds. Your own stack is only limited by memory, where a recursion stack is limited to a fixed size (1000 by default):

def parse_conditions(expr):
    stack = []  # or a `collections.deque()` object, which is a little faster
    top = items = []
    for token in tokenize(expr):
        if token == '(':
            stack.append(items)
            items.append([])
            items = items[-1]
        elif token == ')':
            if not stack:
                raise ValueError("Unbalanced parentheses")
            items = stack.pop()  
        else:
            items.append(token)
    if stack:
        raise ValueError("Unbalanced parentheses")
    return top

You might be interested in looking at the tokenize module, which implements a tokenizer for Python code; the source code uses a series of regular expressions to split out Python source code into tokens (where a token not only contains the token text, but also a token type, the start and end positions (a column, line tuple) and the full line it came from).

Sign up to request clarification or add additional context in comments.

6 Comments

Not if you add your operators to the split group; e.g. [()<>]. Tokenisation just means splitting out an opaque string into interesting components, and if your components are not always space delimited then expand the code regex to handle single character and muti-character operators, keywords and other components specifically.
I note that there was no mention of a more complex grammar in your question however, if your real problem has additional complexities, can I suggest you ask a new question?
Would you mind if I edit this question and include some more complex examples? I just wanted to include a minimal, simple example - my bad.
Kind-of, yes. That’s called “shifting the goal posts”, moving from problem to new problem.
Here's the new question. I would appreciate any help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.