3

I am trying to parse multiline text with the python parsimonious library. I've been playing with it for a while and can't figure out how to deal effectively with newlines. One example is below. The behavior below makes sense. I saw this comment from Erik Rose in the parsimonious issues, but I could not figure out how to implement it without errors. Thanks for any tips here...

singleline_text = '''\
FIRST   something cool'''

multiline_text = '''\
FIRST   something very
        cool
SECOND  more awesomeness        
'''

grammar = Grammar(
    """
    bin           = ORDER spaces description
    ORDER         = 'FIRST' / 'SECOND'
    spaces        = ~'\s*'
    description   = ~'[A-z0-9 ]*'
    """)

Works ok for single line output, print(grammar.parse(singleline_text)) gives:

<Node called "bin" matching "FIRST   something cool">
    <Node called "ORDER" matching "FIRST">
        <Node matching "FIRST">
    <RegexNode called "spaces" matching "   ">
    <RegexNode called "description" matching "something cool">

But multiline gives problems, and I was unable to resolve based on the link above, print(grammar.parse(multiline_text)) gives:

---------------------------------------------------------------------------
IncompleteParseError                      Traceback (most recent call last)
<ipython-input-123-c346891dc883> in <module>()
----> 1 print(grammar.parse(multiline_text))

/Users/me/anaconda3/lib/python3.6/site-packages/parsimonious/grammar.py in parse(self, text, pos)
    121         """
    122         self._check_default_rule()
--> 123         return self.default_rule.parse(text, pos=pos)
    124 
    125     def match(self, text, pos=0):

/Users/me/anaconda3/lib/python3.6/site-packages/parsimonious/expressions.py in parse(self, text, pos)
    110         node = self.match(text, pos=pos)
    111         if node.end < len(text):
--> 112             raise IncompleteParseError(text, node.end, self)
    113         return node
    114 

IncompleteParseError: Rule 'bin' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with '
        cool
SECOND' (line 1, column 23).

Here is one thing I tried that did not work:

grammar2 = Grammar(
    """
    bin           = ORDER spaces description newline
    ORDER         = 'FIRST' / 'SECOND'
    spaces        = ~'\s*'
    description   = ~'[A-z0-9 \n]*'
    newline       = ~r'#[^\r\n]*'
    """)

print(grammar2.parse(multiline_text))

(truncated from the 211-line stack trace):

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 4))

---------------------------------------------------------------------------
SyntaxError                               Traceback (most recent call last)

...


VisitationError: SyntaxError: EOL while scanning string literal (<unknown>, line 1)

Parse tree:
<Node called "spaceless_literal" matching "'[A-z0-9 
]*'">  <-- *** We were here. ***
    <RegexNode matching "'[A-z0-9 
    ]*'">

1 Answer 1

3

It looks like you need to repeat the bin element in your grammar:

grammar = Grammar(
    r"""
    one           = bin +
    bin           = ORDER spaces description newline 
    ORDER         = 'FIRST' / 'SECOND'
    newline       = ~"\n*"
    spaces        = ~"\s*"
    description   = ~"[A-z0-9 ]*"i
    """)

with that you can parse things like:

multiline_text = '''\
FIRST   something very cool
SECOND  more awesomeness      
SECOND  even better
'''
Sign up to request clarification or add additional context in comments.

1 Comment

This works well to deal with the issue of one entry on each line...Thank you. I am trying to use your tips to address the wrapping issue, where a description may span multiple lines.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.