3

I have a basic question about parsing using Python's parsec.py library.

I would like to extract the date somewhere inside a text. For e.g,

Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?

or

Lorem ipsum dolor sit amet.
A number 42 is present here.

But here is a date 11/05/2017. Can you extract this?

In both cases I want the parser to return 11/05/2017.

I only want to use parsec.py parsing library and I don't want to use regex. parsec's built in regex function is okay.

I tried something like

from parsec import *

ss = "Lorem ipsum dolor sit amet. A number 42 is present here. But here is a date 11/05/2017. Can you extract this?"

date_parser = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')

date = date_parser.parse(ss)

I get ParseError: expected [0-9]{2}/[0-9]{2}/[0-9]{4} at 0:0

Is there a way to ignore the text until the date_parser pattern has reached? Without erroring?

4
  • 1
    Excuse me, but you don't want to "parse the text until the date_parser pattern has reached". You want to ignore the text until the date_parser pattern has been reached. If you were parsing the initial text, you would end up with a syntactic analysis of that text. That is the problem parsec is designed for. It is not designed for the problem of finding a regex somewhere inside a long string, which is the problem the regex library is intended to solve. So why do you not want to use it? Commented Jun 4, 2021 at 17:48
  • You are right, I want to ignore the text until the pattern has been reached. I will update the question. Secondly, What if I want to parse a specific date from a line which has a certain pattern within it? Also, with more complex logic like this, regex becomes very difficult to maintain. Commented Jun 4, 2021 at 17:56
  • re.findall(r'[0-9]{2}/[0-9]{2}/[0-9]{4}', ss) will find all occurrences of your date pattern in ss. This is not difficult to maintain, and is simpler than any solution involving parsec. Commented Jun 7, 2021 at 13:19
  • @MichaelDyck This logic can become more complicated. What if I want '-' instead of '/' or ':' instead. What if the date structure if Nov 5th 2017. Writing regex with this logic just becomes more complicated and unreadable after sometime. Commented Jun 8, 2021 at 19:35

1 Answer 1

4

What you want is a parser which skip any unmatched chars, then parse a regex pattern followed.

The date pattern could be defined with regex parser,

date_pattern = regex(r'[0-9]{2}/[0-9]{2}/[0-9]{4}')

We first define a parser which consumle an arbitrary char (which would be included in the library (edit: has been included in v3.9)),

def any():
    '''Parse a random character.'''
    @Parser
    def any_parser(text, index=0):
        if index < len(text):
            return Value.success(index + 1, text[index])
        else:
            return Value.failure(index, 'a random char')
    return any_parser

To express the idea about "skip any chars and match a pattern", we need to define a recursive parser as

date_parser = date_pattern ^ (any() >> date_parser)

But it is not a valid python expression, thus we need

@generate
def date_with_prefix():
    matched = yield(any() >> date_parser)
    return matched

date_parser = date_pattern ^ date_with_prefix

(Here the combinator ^ means try_choice, you could find it in the docs.)

Then it would work as expected:

>>> date_parser.parse("Lorem ipsum dolor sit amet.")
---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
...

ParseError: expected date_with_prefix at 0:27

>>> date_parser.parse("A number 42 is present here.")
---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
...

ParseError: expected date_with_prefix at 0:28

>>> date_parser.parse("But here is a date 11/05/2017. Can you extract this?")
'11/05/2017'

To avoid the expection on invalid input and returns a None instead, you could define it as an optional parser:

date_parser = optional(date_pattern ^ date_with_prefix)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much for helping me. Yes, I was looking for a recursive solution you outlined. Although it works for smaller cases, For larger texts I reach a RecursionError. This is probably a limitation of Python. Not sure if you have more insights into this. But for now, This works. Thanks!
Thanks for let me know the trouble with large text. I think it is solvable and I will comment here when I have some progress on that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.