Extracting year which preceded with a month using Regex python

Question

I have thousands of datasets from where I am interested in extracting the year which preceded a month. For example:

In dataset 1: September 1980

In dataset 2: October, 1978

The regular expression that I wrote using https://regex101.com/:

^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)

It does do the job using the link. However, when I tried to use it in my python code, I was getting the below error:

  File "<ipython-input-216-a995358d0957>", line 1, in <module>
    runfile('C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py', wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data')
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py", line 76, in <module>
    year_data = re.findall('^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)', tokenized_string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 691, in _parse
    len(char) + 2)
error: unknown extension ?<m

I am not sure why it is causing this error. Can anyone provide me with an explanation with a possible solution? Your help would be much appreciated.

Thanks

This is not a valid Python regexp. You probably tested it on regex101.com with PHP selected (under "Flavor"). — Błotosmętek
– Błotosmętek, Commented Feb 25, 2020 at 11:04
Hello Wiktor, It works but it extracts multiple years from the document which I do not want. I want to extract the only year which precedes with the month and the line started with the month. This is why I was using "^" (cap) symbol before the regex. — Muntabir Choudhury
– Muntabir Choudhury, Commented Feb 26, 2020 at 15:52

Muntabir Choudhury · Accepted Answer · 2020-03-02 21:31:42Z

1

I really appreciate all of your contributions. But @Joan Lara Ganau's solution provided me with a guideline what the regexp could be. @Joan, your regexp is going to match if any year preceded with a month and a date. Also, it does not search for a comma and space. As I mentioned that I have thousands of datasets from where I exactly want to extract a year which preceded with a month. I was looking for the following format:

a.) Month Year b.) Month, Year

Anyway, I found the solution to my problem set after doing a number of experiments. The solution is:

year_result = re.compile(
                    r"(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|"
                    "Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|"
                    "Dec(ember)?)(,?)(\s\d{4})")

Also, the match() method will also return None if the pattern does not get matched. In that case, using the group() method will throw an AttributeError. The error is something like None type object does not have a matching group(). So, I fixed it in the following manner:

def matched(document):                   
         year = year_result.match(document)
         year = year_result.search(document)
         if year is None:
               return '0'
         return year.group(14)

Now you can pass the text document from where you want to extract the year to the above function.

Thanks

answered Mar 2, 2020 at 21:31

Muntabir Choudhury

194 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joan Lara Over a year ago

Glad I helped you :)

Joan Lara · Accepted Answer · 2020-02-25 11:15:11Z

0

import re

year = re.compile(r'(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?\D?(\d{1,4})')
print(year.match('September 1980').group(3))
print(year.match('October, 1978').group(3))

Output:

1980
1978

answered Feb 25, 2020 at 11:15

Joan Lara

1,3978 silver badges15 bronze badges

Comments

Toto · Accepted Answer · 2020-02-25 11:35:21Z

0

A named capture group is: (?P<name>...) not ~~(?<name>...)~~.

Use: ^(?P<month>\w+),?\s[0-9]{4}$

Demo & explanation

answered Feb 25, 2020 at 11:35

Toto

91.7k63 gold badges97 silver badges135 bronze badges

Collectives™ on Stack Overflow

Extracting year which preceded with a month using Regex python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related