Split "Nested" String By Parentheses into Nested List

Question

I have a string representation of a tree. I'd like to convert it to a nested list. Is there a way to do this recursively, so that I end up with nested lists?

An example string looks like:

(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP
(IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP
(NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))

So far I've the below, but it does not give me what I'm looking for, at all.

import sys
import re

for tree_str in sys.stdin:
    print [", ".join(x.split()) for x in re.split(r'[()]',tree_str) if x.strip()]

Could you give an example of what you want as output? Do the spaces in the string matter? — Jivan
– Jivan, Commented Dec 22, 2014 at 23:59
Spaces don't matter. This is from the Penn Treebank, which can, sometimes, be nice. — Adam_G
– Adam_G, Commented Dec 23, 2014 at 0:49

zord · Accepted Answer · 2014-12-23 00:53:35Z

3

My approach would be something like this:

import re


def make_tree(data):
    items = re.findall(r"\(|\)|\w+", data)

    def req(index):
        result = []
        item = items[index]
        while item != ")":
            if item == "(":
                subtree, index = req(index + 1)
                result.append(subtree)
            else:
                result.append(item)
            index += 1
            item = items[index]
        return result, index

    return req(1)[0]


string = "(TOP (S (NP (PRP I))..." # omitted for readability
tree = make_tree(string)

print(tree)
# Output: ['TOP', ['S', ['NP', ['PRP', 'I']]...

answered Dec 23, 2014 at 0:53

zord

4,8333 gold badges28 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jivan · Accepted Answer · 2014-12-23 00:53:18Z

1

A bit hacky but kinda does the trick anyway :) You definitely have your nested lists.

import re
import ast

input = "(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))"

# replaces all brackets by square brackets
# and adds commas when needed
input = input.replace("(", "[")\
             .replace(")", "]")\
             .replace("] [", "], [")

# places all the words between double quotes
# and appends a comma after each
input = re.sub(r'(\w+)', r'"\1",', input)

# safely evaluates the resulting string
output = ast.literal_eval(input)

print(output)
print(type(output))

# ['TOP', ['S', ['NP', ['PRP', 'I']], ['VP', ['VBP', 'need'], ['NP', ['NP', ['DT', 'a'], ['NN', 'flight']], ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]], ['PP', ['TO', 'to'], ['NP', ['NP', ['NNP', 'Charlotte']], ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]], ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]
# <class 'list'>

Note: for safety reasons, ast.literal_eval() throws an error if the expression contains operators or some kind of logic, so that you can use it without having to check for malicious code first.

edited Dec 23, 2014 at 0:53

answered Dec 23, 2014 at 0:08

Jivan

23.4k17 gold badges92 silver badges144 bronze badges

2 Comments

Adam Smith Over a year ago

Beautiful kludge! Normally I'd strongly discourage using str.replace and eval to do something like this, but in this case it really does seem like the fastest approach.

Adam Smith Over a year ago

Note I would probably dispense with making it a list and just do input.replace(") (", "), ("), re.sub(r"(\w+)", r'"\1",', input) and output = ast.literal_eval(input) but YMMV

pillmuncher · Accepted Answer · 2014-12-23 01:15:31Z

Writing a simple parser for S-Expressions is not that hard:

import pprint
import re

pattern = r'''
    (?P<open_paren> \( ) |
    (?P<close_paren> \) ) |
    (?P<word> \w+) |
    (?P<whitespace> \s+) |
    (?P<eof> $) |
    (?P<error> \S)
'''

scan = re.compile(pattern=pattern, flags=re.VERBOSE).finditer

text = '''
(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight))
 (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte))
 (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))
'''

ERR_MSG = 'input string kaputt!!'

stack = [[]]

for match in scan(text):
    token_type = match.lastgroup
    token = match.group(0)
    if token_type == 'open_paren':
        stack.append([])
    elif token_type == 'close_paren':
        top = stack.pop()
        stack[-1].append(top)
    elif token_type == 'word':
        stack[-1].append(token)
    elif token_type == 'whitespace':
        pass
    elif token_type == 'eof':
        break
    else:
        raise Exception(ERR_MSG)

if 1 == len(stack) == len(stack[0]):
    pprint.pprint(stack[0][0])
else:
    raise Exception(ERR_MSG)

Result:

['TOP',
 ['S',
  ['NP', ['PRP', 'I']],
  ['VP',
   ['VBP', 'need'],
   ['NP',
    ['NP', ['DT', 'a'], ['NN', 'flight']],
    ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]],
    ['PP',
     ['TO', 'to'],
     ['NP',
      ['NP', ['NNP', 'Charlotte']],
      ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]],
    ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]

Svante · Accepted Answer · 2014-12-23 00:57:25Z

-1

This is called "parsing". One parser generator for Python seems to be Yapps. Yapps' documentation even shows how to write a Lisp parser, of which your application seems to be just a subset.

The subset that you need seems to be:

parser Sublisp:
    ignore:      '\\s+'
    token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+' 

    rule expr:   ID     {{ return ('id', ID) }}
               | list   {{ return list }}
    rule list: "\\("    {{ result = [] }} 
               ( expr   {{ result.append(expr) }}
               )*  
               "\\)"    {{ return result }}

After compiling, this will parse your string to a tree of tuples ('id', 'FOO'). To get the tree in the form you desire, you can either modify the generated python code (it is quite readable) or transform the tree afterwards.

answered Dec 23, 2014 at 0:57

Svante

51.8k11 gold badges84 silver badges127 bronze badges

1 Comment

Svante Over a year ago

@Adam_G: Yapps is a parser generator, written in Python, that produces Python code from a grammar definition. The above is such a grammar definition.

Collectives™ on Stack Overflow

Split "Nested" String By Parentheses into Nested List

4 Answers 4

Comments

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related