1

I have strings that have dates in different formats. For example,

sample_str_1 = 'this amendment of lease, made and entered as of the  10th day of august, 2016,   by and between john doe and jane smith'

Also, another string that has the date in it as,

sample_str_2 ='this agreement, made and entered as of May 1, 2016, between john doe and jane smith'

In order to extract just the date from the first string, I did something like this,

match = re.findall(r'\S+d{4}\s+', sample_str_1)

this gives an empty list.

For the second string, I used the same method as I used for first string and getting an empty string.

I also, tried datefinder module and it gave me an output like this,

import datefinder
match = datefinder.find_dates(sample_str_1)

for m in match:
    print(m)

>> 2016-08-01 00:00:00

Above output is wrong, which should be 2016-08-10 00:00:00

I tried another way using this older post

match = re.findall(r'\d{2}(?:january|february|march|april|may|june|july|august|september|october|november|december)\d{4}',sample_str_1)

This again gave me an empty list.

How can I extract dates like that from a string? Is there a generic method to extract dates that have text and digits? Any help would be appreciated.

6
  • Maybe you should look at the dateparser package. Reinventing the wheel here doesn't make much sense... Commented Mar 1, 2018 at 21:21
  • @ctwheels That didn't wordk, I used date_parse = DateDataParser().get_date_data(sample_str_1) and I got {'date_obj': None, 'locale': None, 'period': 'day'} Commented Mar 1, 2018 at 21:43
  • Do you only need to match the specific phrases [day]st/nd/rd/th day of [month], [year] and [month] [day], [year]? There are many other ways to format a date. Commented Mar 1, 2018 at 22:17
  • You have only two formats of date 10th day of august, 2016 and May 1, 2016? Commented Mar 1, 2018 at 22:21
  • @CAustin yes, that is one format and string 2 has a different format. Commented Mar 1, 2018 at 22:24

1 Answer 1

1

Regex: (?:(\d{1,2})(?:th|nd|rd).* ([a-z]{3})[a-z]*|([a-z]{3})[a-z]* (\d{1,2})),\s*(\d{4})

Python code:

regex = re.compile('(?:(\d{1,2})(?:th|nd|rd).* ([a-z]{3})[a-z]*|([a-z]{3})[a-z]* (\d{1,2})),\s*(\d{4})', re.I)

for x in regex.findall(text):
    if x[0] == '':
        date = '-'.join(filter(None, x))
    else:
        date = '%s-%s-%s' % (x[1],x[0],x[4])

    print(datetime.datetime.strptime(date, '%b-%d-%Y').date())

Output:

2016-08-10
2016-05-01

Code demo

Sign up to request clarification or add additional context in comments.

1 Comment

this works great. What can I do if I have 2nd, 3rd etc. I tried to add (?:(\d{1,2})th|nd|rd.* (.. it prints blank. How can I add that? (As I am a new user, I can not upvote yet, as you deserve one)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.