0

I have thousands of datasets from where I am interested in extracting the year which preceded a month. For example:

In dataset 1: September 1980

In dataset 2: October, 1978

The regular expression that I wrote using https://regex101.com/:

^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)

It does do the job using the link. However, when I tried to use it in my python code, I was getting the below error:

  File "<ipython-input-216-a995358d0957>", line 1, in <module>
    runfile('C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py', wdir='C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data')
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)
  File "C:\Users\Muntabir\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/Muntabir/nltk_data/corpora/cookbook/clean_data/text-classification_year(clean).py", line 76, in <module>
    year_data = re.findall('^(?<month>)\w+(\1)\s[0-9]{4}$|(^(?<fmonth>)\w+,\s[0-9]{4}$)', tokenized_string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\Muntabir\Anaconda3\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))
  File "C:\Users\Muntabir\Anaconda3\lib\sre_parse.py", line 691, in _parse
    len(char) + 2)
error: unknown extension ?<m

I am not sure why it is causing this error. Can anyone provide me with an explanation with a possible solution? Your help would be much appreciated.

Thanks

3
  • 2
    This is not a valid Python regexp. You probably tested it on regex101.com with PHP selected (under "Flavor"). Commented Feb 25, 2020 at 11:04
  • r'\w+,?\s+[0-9]{4}(?!\d)' Commented Feb 25, 2020 at 11:08
  • Hello Wiktor, It works but it extracts multiple years from the document which I do not want. I want to extract the only year which precedes with the month and the line started with the month. This is why I was using "^" (cap) symbol before the regex. Commented Feb 26, 2020 at 15:52

3 Answers 3

1

I really appreciate all of your contributions. But @Joan Lara Ganau's solution provided me with a guideline what the regexp could be. @Joan, your regexp is going to match if any year preceded with a month and a date. Also, it does not search for a comma and space. As I mentioned that I have thousands of datasets from where I exactly want to extract a year which preceded with a month. I was looking for the following format:

a.) Month Year b.) Month, Year

Anyway, I found the solution to my problem set after doing a number of experiments. The solution is:

year_result = re.compile(
                    r"(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|"
                    "Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|"
                    "Dec(ember)?)(,?)(\s\d{4})")

Also, the match() method will also return None if the pattern does not get matched. In that case, using the group() method will throw an AttributeError. The error is something like None type object does not have a matching group(). So, I fixed it in the following manner:

def matched(document):                   
         year = year_result.match(document)
         year = year_result.search(document)
         if year is None:
               return '0'
         return year.group(14)

Now you can pass the text document from where you want to extract the year to the above function.

Thanks

Sign up to request clarification or add additional context in comments.

1 Comment

Glad I helped you :)
0
import re

year = re.compile(r'(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?\D?(\d{1,4})')
print(year.match('September 1980').group(3))
print(year.match('October, 1978').group(3))

Output:

1980
1978

Comments

0

A named capture group is: (?P<name>...) not (?<name>...).

Use: ^(?P<month>\w+),?\s[0-9]{4}$

Demo & explanation

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.