0

I am building regular expressions to find dates in my text. I have created lists for the month name, day, and specials characters that are part of a date.

dict_month_name =['january','february','march','april','may','june','july','august','september','october','november','december']

dict_day =['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

dict_special_char = ['-', '/', '.', ',' ,'',' ']

I have also compiled them as shown below.

month_name = re.compile('|'.join(dict_month_name))

day = re.compile('|'.join(dict_day))

special_char = re.compile('|'.join(dict_special_char))

Now, in my regular expression shown below, I want to use different variations of the lists I created earlier. For e.g. to search for dates like - Monday, January 2017 the regex would be -

regexp1 = re.findall('.*?^(day+,\s,month_name+\s[0-9][0-9][0-9][0-9])$.*', text)

However, the regex is not returning any output. I need to solve this using regex and not the datetime module. Is there a way I can include my list inside the regular expression as shown above ?

4
  • regexp1 is not using any of the precompiled regexes, and is literally searching for 'day' and 'month_name' in text. Commented Mar 1, 2018 at 13:32
  • I don't think there's a way to directly combine compiled regexes. Closest I could find is this. Commented Mar 1, 2018 at 13:32
  • @DeepSpace Is there a way I can tell the re.findall function to read "day" and "month_name" as a list and not text to search for a pattern as you mentioned? Commented Mar 1, 2018 at 13:41
  • thanks for the advice. I have reviewed my questions and accepted answers as appropriate. Commented Mar 6, 2018 at 18:57

1 Answer 1

1

You may combine the regex the following way:

import re
dict_month_name =['january','february','march','april','may','june','july','august','september','october','november','december']
dict_day =['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
dict_special_char = ['-', '/', '.', ',' ,'',' ']

s = 'For e.g. to search for dates like - Monday, January 2017 the regex would be'
rx = r"\b(?:{day})[{special}]\s+(?:{month_name})\s+[0-9]{{4}}\b".format(
    day="|".join(dict_day), 
    special="".join([re.escape(x) for x in dict_special_char]), 
    month_name="|".join(dict_month_name))

print(re.findall(rx, s, re.I)) # => ['Monday, January 2017']

See the Python demo.

In this example, the regex will be

\b(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)[\-\/\.\,\ ]\s+(?:january|february|march|april|may|june|july|august|september|october|november|december)\s+[0-9]{4}\b

You see that the patterns are now part of a bigger pattern. re.I enables case insensitive matching.

Also note that special chars should be escaped with [re.escape(x) for x in dict_special_char] in order to get matched as literal chars.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. That seemed to work. One more thing though. I have many variations within my date data. Do I need to write a regex for each unique format or is there a more efficient way to solve this using a regex dictionary method?
@user8929822 I think you need to handle them with their respective patterns, but you may use | to add alternatives to one single regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.