2

I have a string representation of a tree. I'd like to convert it to a nested list. Is there a way to do this recursively, so that I end up with nested lists?

An example string looks like:

(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP
(IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP
(NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))

So far I've the below, but it does not give me what I'm looking for, at all.

import sys
import re

for tree_str in sys.stdin:
    print [", ".join(x.split()) for x in re.split(r'[()]',tree_str) if x.strip()] 
3
  • 1
    Could you give an example of what you want as output? Do the spaces in the string matter? Commented Dec 22, 2014 at 23:59
  • 2
    That's an interestingly human-readable tree serialization. Commented Dec 23, 2014 at 0:41
  • Spaces don't matter. This is from the Penn Treebank, which can, sometimes, be nice. Commented Dec 23, 2014 at 0:49

4 Answers 4

3

My approach would be something like this:

import re


def make_tree(data):
    items = re.findall(r"\(|\)|\w+", data)

    def req(index):
        result = []
        item = items[index]
        while item != ")":
            if item == "(":
                subtree, index = req(index + 1)
                result.append(subtree)
            else:
                result.append(item)
            index += 1
            item = items[index]
        return result, index

    return req(1)[0]


string = "(TOP (S (NP (PRP I))..." # omitted for readability
tree = make_tree(string)

print(tree)
# Output: ['TOP', ['S', ['NP', ['PRP', 'I']]...
Sign up to request clarification or add additional context in comments.

Comments

1

A bit hacky but kinda does the trick anyway :) You definitely have your nested lists.

import re
import ast

input = "(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight)) (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte)) (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))"

# replaces all brackets by square brackets
# and adds commas when needed
input = input.replace("(", "[")\
             .replace(")", "]")\
             .replace("] [", "], [")

# places all the words between double quotes
# and appends a comma after each
input = re.sub(r'(\w+)', r'"\1",', input)

# safely evaluates the resulting string
output = ast.literal_eval(input)

print(output)
print(type(output))

# ['TOP', ['S', ['NP', ['PRP', 'I']], ['VP', ['VBP', 'need'], ['NP', ['NP', ['DT', 'a'], ['NN', 'flight']], ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]], ['PP', ['TO', 'to'], ['NP', ['NP', ['NNP', 'Charlotte']], ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]], ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]
# <class 'list'>

Note: for safety reasons, ast.literal_eval() throws an error if the expression contains operators or some kind of logic, so that you can use it without having to check for malicious code first.

2 Comments

Beautiful kludge! Normally I'd strongly discourage using str.replace and eval to do something like this, but in this case it really does seem like the fastest approach.
Note I would probably dispense with making it a list and just do input.replace(") (", "), ("), re.sub(r"(\w+)", r'"\1",', input) and output = ast.literal_eval(input) but YMMV
0

Writing a simple parser for S-Expressions is not that hard:

import pprint
import re

pattern = r'''
    (?P<open_paren> \( ) |
    (?P<close_paren> \) ) |
    (?P<word> \w+) |
    (?P<whitespace> \s+) |
    (?P<eof> $) |
    (?P<error> \S)
'''

scan = re.compile(pattern=pattern, flags=re.VERBOSE).finditer

text = '''
(TOP (S (NP (PRP I)) (VP (VBP need) (NP (NP (DT a) (NN flight))
 (PP (IN from) (NP (NNP Atlanta))) (PP (TO to) (NP (NP (NNP Charlotte))
 (NP (NNP North) (NNP Carolina)))) (NP (JJ next) (NNP Monday))))))
'''

ERR_MSG = 'input string kaputt!!'

stack = [[]]

for match in scan(text):
    token_type = match.lastgroup
    token = match.group(0)
    if token_type == 'open_paren':
        stack.append([])
    elif token_type == 'close_paren':
        top = stack.pop()
        stack[-1].append(top)
    elif token_type == 'word':
        stack[-1].append(token)
    elif token_type == 'whitespace':
        pass
    elif token_type == 'eof':
        break
    else:
        raise Exception(ERR_MSG)

if 1 == len(stack) == len(stack[0]):
    pprint.pprint(stack[0][0])
else:
    raise Exception(ERR_MSG)

Result:

['TOP',
 ['S',
  ['NP', ['PRP', 'I']],
  ['VP',
   ['VBP', 'need'],
   ['NP',
    ['NP', ['DT', 'a'], ['NN', 'flight']],
    ['PP', ['IN', 'from'], ['NP', ['NNP', 'Atlanta']]],
    ['PP',
     ['TO', 'to'],
     ['NP',
      ['NP', ['NNP', 'Charlotte']],
      ['NP', ['NNP', 'North'], ['NNP', 'Carolina']]]],
    ['NP', ['JJ', 'next'], ['NNP', 'Monday']]]]]]

Comments

-1

This is called "parsing". One parser generator for Python seems to be Yapps. Yapps' documentation even shows how to write a Lisp parser, of which your application seems to be just a subset.

The subset that you need seems to be:

parser Sublisp:
    ignore:      '\\s+'
    token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+' 

    rule expr:   ID     {{ return ('id', ID) }}
               | list   {{ return list }}
    rule list: "\\("    {{ result = [] }} 
               ( expr   {{ result.append(expr) }}
               )*  
               "\\)"    {{ return result }}

After compiling, this will parse your string to a tree of tuples ('id', 'FOO'). To get the tree in the form you desire, you can either modify the generated python code (it is quite readable) or transform the tree afterwards.

1 Comment

@Adam_G: Yapps is a parser generator, written in Python, that produces Python code from a grammar definition. The above is such a grammar definition.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.