Parsing nested lists and returning original strings for every valid list

Question

Suppose I have a string s = '{aaaa{bc}xx{d{e}}f}', which has a structure of nested lists. I would like to have an hierarchical representation for it, while being able to access the sub-strings corresponding to the valid sub-lists. For simplicity, let's forget about the hierarchy, and I just want a list of sub-strings corresponding to valid sub-lists, something like:

['{aaaa{bc}xx{d{e}}f}', '{bc}', '{d{e}}', '{e}']

Using nestedExpr, one can obtain the nested structure, which includes all valid sub-lists:

import pyparsing as pp

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = expr('L0 Contents').parseString(s)
print(res.dump())

prints:

[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
- L0 Contents: [['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
  [0]:
    ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
    [0]:
      aaaa
    [1]:
      ['bc']
    [2]:
      xx
    [3]:
      ['d', ['e']]
      [0]:
        d
      [1]:
        ['e']
    [4]:
      f

In order to obtain the original string representation for a parsed element, I have to wrap it into pyparsing.originalTextFor(). However, this will remove all sub-lists from the result:

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = pp.originalTextFor(expr)('L0 Contents').parseString(s)
print(res.dump())

prints:

['{aaaa{bc}xx{d{e}}f}']
- L0 Contents: '{aaaa{bc}xx{d{e}}f}'

In effect, the originalTextFor() wrapper flattened out everything that was inside it.

The question. Is there an alternative to originalTextFor() that keeps the structure of its child parse elements? (It would be nice to have a non-discarding analogue, which could be used for creation of named tokens for parsed sub-expressions)

Note that scanString() will only give me the level 0 sub-lists, and will not look inside. I guess, I could use setParseAction(), but the mode of internal operation of ParserElement's is not documented, and I haven't had a chance to dig into the source code yet. Thanks!

Update 1. Somewhat related: https://stackoverflow.com/a/39885391/11932910 https://stackoverflow.com/a/17411455/11932910

I hope you don't have to dig into the pyparsing internals to get this kind of thing. Parse actions can maniipulate the parsed results and then give them back to pyparsing as modified. Also, you can learn more about the breadth of classes and helpers at pyparsing-docs.readthedocs.io/en/pyparsing_2.4.6/pyparsing.html — PaulMcG
– PaulMcG, Commented Jun 7, 2020 at 5:05
@Roy2012 The output I am interested in is in the first code blockquote. — paperskilltrees
– paperskilltrees, Commented Jun 7, 2020 at 14:45

PaulMcG · Accepted Answer · 2020-06-08 10:41:30Z

Instead of using originalTextFor, wrap your nestedExpr expression in locatedExpr:

import pyparsing as pp
parser = pp.locatedExpr(pp.nestedExpr('{','}'))

locatedExpr will return a 3-element ParseResults:

start location
parsed value
end location

You can then attach a parse action to this parser to modify the parsed tokens in place, and add your own original_string named result, containing the original text as sliced from the input string:

def extract_original_text(st, loc, tokens):
    start, tokens[:], end = tokens[0]
    tokens['original_string'] = st[start:end]
parser.addParseAction(extract_original_text)

Now use this parser to parse and dump the results:

result = parser.parseString(s)
print(result.dump())

Prints:

['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
- original_string: '{aaaa{bc}xx{d{e}}f}'

And access the original_string result using:

print(result.original_string)

EDIT - how to attach original_string to each nested substructure

To maintain the original strings on the sub-structures requires a bit more work than can be done in just nested_expr. You pretty much have to implement your own recursive parser.

To implement your own version of nested_expr, you'll start with something like this:

LBRACE, RBRACE = map(pp.Suppress, "{}")
expr = pp.Forward()

term = pp.Word(pp.alphas)
expr_group = pp.Group(LBRACE + expr + RBRACE)
expr_content = term | expr_group

expr <<= expr_content[...]

print(expr.parseString(sample).dump())

This will dump out the parsed results, without the 'original_string' names:

{aaaa{bc}xx{d{e}}f}
[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
[0]:
  ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
  [0]:
    aaaa
  [1]:
    ['bc']
  [2]:
    xx
  [3]:
    ['d', ['e']]
    [0]:
      d
    [1]:
      ['e']
  [4]:
    f

To add the 'original_string' names, we first change the Group to the locatedExpr wrapper.

expr_group = pp.locatedExpr(LBRACE + expr + RBRACE)

This will add the start and end locations to each nested subgroup (which is not accessible to you when using nestedExpr).

{aaaa{bc}xx{d{e}}f}
[[0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]]
[0]:
  [0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]
  - locn_end: 19
  - locn_start: 0
  - value: ['aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f']
    [0]:
      aaaa
    [1]:
      [5, 'bc', 9]
      - locn_end: 9
      - locn_start: 5
      - value: ['bc']
...

Our parse action is now more complicated also.

def extract_original_text(st, loc, tokens):
    # pop/delete names and list items inserted by locatedExpr
    # (save start and end locations to local vars)
    tt = tokens[0]
    start = tt.pop("locn_start")
    end = tt.pop("locn_end")
    tt.pop("value")
    del tt[0]
    del tt[-1]

    # add 'original_string' results name
    orig_string = st[start:end]
    tt['original_string'] = orig_string

expr_group.addParseAction(extract_original_text)

With this change, you will now get this structure:

{aaaa{bc}xx{d{e}}f}
[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
[0]:
  ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
  - original_string: '{aaaa{bc}xx{d{e}}f}'
  [0]:
    aaaa
  [1]:
    ['bc']
    - original_string: '{bc}'
  [2]:
    xx
  [3]:
    ['d', ['e']]
    - original_string: '{d{e}}'
    [0]:
      d
    [1]:
      ['e']
      - original_string: '{e}'
  [4]:
    f

Note: There is a limitation in the current version of ParseResults.dump that only shows keys or subitems, but not both - this output requires a fix that removes that limitation, to be released in next pyparsing version. But even though dump() does not show these substructures, they are there in your actual structure, as you can see if you print out the repr of the results:

print(repr(result[0]))

(['aaaa', (['bc'], {'original_string': '{bc}'}), 'xx', (['d', (['e'], {'original_string': '{e}'})], {'original_string': '{d{e}}'}), 'f'], {'original_string': '{aaaa{bc}xx{d{e}}f}'})

Is it possible to accumulate the original sub-strings recursively, so that I get '{aaaa{bc}xx{d{e}}f}' as well as '{bc}' (and so on) in the result?

Collectives™ on Stack Overflow

Parsing nested lists and returning original strings for every valid list

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related