I'm writing some code that is being used to parse dates out of a very large data set. I have the following regex to match different variations of dates
"(((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
+"((january|february|march|april|may|june|july|august|september|october|november|december)"
+ "\\s*(0?[1-9]|[12][0-9]|3[01])(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"
which matches dates of format 'Month dd, yyyy', 'mm/dd/yyyy', and 'mm-dd-yyyy'. This works fine for those formats, but I'm now encountering dates in the European 'dd Month, yyyy' format. I tried adding (\\d{1,2})? at the beginning of the regex and adding a ? quantifier after the current day matching section of the regex as such
"((\\d{1,2})?((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
+"((january|february|march|april|may|june|july|august|september|october|november|december)"
+ "\\s*(0?[1-9]|[12][0-9]|3[01])?(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"
but this is not entirely viable as it sometimes captures numeric characters both before and after the month (ex. '00 January 15, 2013') and sometimes neither ('January 2013'). Is there a way to ensure that exactly one of the two is captured?
SimpleDateFormatSimpleDateFormats. Getting this right in one ugly, monstrous, unmaintainable regex will waste your precious livetime. If you seriously think it must be regex, let us know why to find a way out.