2

I'm writing some code that is being used to parse dates out of a very large data set. I have the following regex to match different variations of dates

"(((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
 +"((january|february|march|april|may|june|july|august|september|october|november|december)"
 + "\\s*(0?[1-9]|[12][0-9]|3[01])(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"

which matches dates of format 'Month dd, yyyy', 'mm/dd/yyyy', and 'mm-dd-yyyy'. This works fine for those formats, but I'm now encountering dates in the European 'dd Month, yyyy' format. I tried adding (\\d{1,2})? at the beginning of the regex and adding a ? quantifier after the current day matching section of the regex as such

"((\\d{1,2})?((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
 +"((january|february|march|april|may|june|july|august|september|october|november|december)"
 + "\\s*(0?[1-9]|[12][0-9]|3[01])?(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"

but this is not entirely viable as it sometimes captures numeric characters both before and after the month (ex. '00 January 15, 2013') and sometimes neither ('January 2013'). Is there a way to ensure that exactly one of the two is captured?

4
  • 5
    Take a look at SimpleDateFormat Commented Jul 23, 2014 at 17:20
  • actually, SimpleDateFormat is probably not rigid enough. I'd use Joda DateTimeFormatter instead. Commented Jul 23, 2014 at 17:23
  • If you know where to expect your dates, clearly use a bunch of SimpleDateFormats. Getting this right in one ugly, monstrous, unmaintainable regex will waste your precious livetime. If you seriously think it must be regex, let us know why to find a way out. Commented Jul 23, 2014 at 19:46
  • Is it just me, or does this question pop up about once a day? Commented Aug 2, 2014 at 2:40

1 Answer 1

1

Give you one Java implementation for your requirements (searching the date from inpiut texts):

        String input = "which matches dates of format 'january 31, 1976', '9/18/2013', "
                + "and '11-20-1988'. This works fine for those formats, but I'm now encountering dates" +
                "in the European '26th May, 2020' format. I tried adding (\\d{1,2})? at the"+
                "beginning of the regex and adding a ? quantifier after the current day matching section of the regex as such";

    String months_t = "(january|february|march|april|may|june|july|august|september|october|november|december)";
    String months_d = "(1[012]|0?[1-9])";
    String days_d = "(3[01]|[12][0-9]|0?[1-9])"; //"\\d{1,2}";
    String year_d = "((19|20)\\d\\d)";
    String days_d_a = "(" + days_d + "(th|rd|nd|st)?)";

    // 'mm/dd/yyyy', and 'mm-dd-yyyy'
    String regexp1 = "(" + months_d + "[/-]" + days_d + "[/-]"
            + year_d + ")";
    // 'Month dd, yyyy', and 'dd Month, yyyy'
    String regexp2 = "(((" + months_t + "\\s*" + days_d_a + ")|("
            + days_d_a + "\\s*" + months_t + "))[,\\s]+" + year_d + ")";
    String regexp = "(?i)" + regexp1 + "|" + regexp2;

    Pattern pMod = Pattern.compile(regexp);
    Matcher mMod = pMod.matcher(input);

    while (mMod.find()) {
        System.out.println(mMod.group(0));
    }

The Output is :

january 31, 1976
9/18/2013
11-20-1988
26th May, 2020
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.