Recursive parsing with Python's parsec.py library

Question

I have a basic question about parsing using Python's parsec.py library.

I would like to extract the date somewhere inside a text. For e.g,

Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?

or

Lorem ipsum dolor sit amet.
A number 42 is present here.

But here is a date 11/05/2017. Can you extract this?

In both cases I want the parser to return 11/05/2017.

I only want to use parsec.py parsing library and I don't want to use regex. parsec's built in regex function is okay.

I tried something like

from parsec import *

ss = "Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?"

date_parser = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')

date = date_parser.parse(ss)

I get ParseError: expected [0-9]{2}/[0-9]{2}/[0-9]{4} at 0:0

Is there a way to ignore the text until the date_parser pattern has reached? Without erroring?

Excuse me, but you don't want to "parse the text until the date_parser pattern has reached". You want to ignore the text until the date_parser pattern has been reached. If you were parsing the initial text, you would end up with a syntactic analysis of that text. That is the problem parsec is designed for. It is not designed for the problem of finding a regex somewhere inside a long string, which is the problem the regex library is intended to solve. So why do you not want to use it? — rici
– rici, Commented Jun 4, 2021 at 17:48
You are right, I want to ignore the text until the pattern has been reached. I will update the question. Secondly, What if I want to parse a specific date from a line which has a certain pattern within it? Also, with more complex logic like this, regex becomes very difficult to maintain. — link
– link, Commented Jun 4, 2021 at 17:56
re.findall(r'[0-9]{2}/[0-9]{2}/[0-9]{4}', ss) will find all occurrences of your date pattern in ss. This is not difficult to maintain, and is simpler than any solution involving parsec. — Michael Dyck
– Michael Dyck, Commented Jun 7, 2021 at 13:19
@MichaelDyck This logic can become more complicated. What if I want '-' instead of '/' or ':' instead. What if the date structure if Nov 5th 2017. Writing regex with this logic just becomes more complicated and unreadable after sometime. — link
– link, Commented Jun 8, 2021 at 19:35

sighingnow · Accepted Answer · 2021-06-08 04:13:05Z

4

What you want is a parser which skip any unmatched chars, then parse a regex pattern followed.

The date pattern could be defined with regex parser,

date_pattern = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')

We first define a parser which consumle an arbitrary char (which would be included in the library (edit: has been included in v3.9)),

def any():
    '''Parse a random character.'''
    @Parser
    def any_parser(text, index=0):
        if index < len(text):
            return Value.success(index + 1, text[index])
        else:
            return Value.failure(index, 'a random char')
    return any_parser

To express the idea about "skip any chars and match a pattern", we need to define a recursive parser as

date_parser = date_pattern ^ (any() >> date_parser)

But it is not a valid python expression, thus we need

@generate
def date_with_prefix():
    matched = yield(any() >> date_parser)
    return matched

date_parser = date_pattern ^ date_with_prefix

(Here the combinator ^ means try_choice, you could find it in the docs.)

Then it would work as expected:

>>> date_parser.parse("Lorem ipsum dolor sit amet.")
---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
...

ParseError: expected date_with_prefix at 0:27

>>> date_parser.parse("A number 42 is present here.")
---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
...

ParseError: expected date_with_prefix at 0:28

>>> date_parser.parse("But here is a date 11/05/2017. Can you extract this?")
'11/05/2017'

To avoid the expection on invalid input and returns a None instead, you could define it as an optional parser:

date_parser = optional(date_pattern ^ date_with_prefix)

edited Jun 8, 2021 at 4:13

answered Jun 8, 2021 at 2:41

sighingnow

8416 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

link Over a year ago

Thank you so much for helping me. Yes, I was looking for a recursive solution you outlined. Although it works for smaller cases, For larger texts I reach a RecursionError. This is probably a limitation of Python. Not sure if you have more insights into this. But for now, This works. Thanks!

sighingnow Over a year ago

Thanks for let me know the trouble with large text. I think it is solvable and I will comment here when I have some progress on that.

Collectives™ on Stack Overflow

Recursive parsing with Python's parsec.py library

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related