I am struggling with regex performance issue while parsing large text files.
I am using .NET 4.0 with the following code:
private static pattern =
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])\4(0[1-9]|[12][0-9]|3[01]|[0-9]) (\D|$))|" +
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\11(0[1-9]|1[012]|[0-9]) (\D|$))|" +
@"((\D|^)(0[1-9]|1[012]|[0-9])([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\18(19|20|)\d\d(\D|$))|" +
@"((\D|^)(0[1-9]|[12][0-9]|3[01]|[0-9])([- /.\\])(0[1-9]|1[012]|[0-9])\25(19|20|)\d\d(\D|$))|" +
@"((\D|^)(19|20|)\d\d(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(\D|$))|" +
@"((\D|^)(19|20|)\d\d(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(\D|$))|" +
@"((\D|^)(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(19|20|)\d\d(\D|$))|" +
@"((\D|^)(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(19|20|)\d\d(\D|$))|" +
@"((^|(?<!(\d[- /.\\\d])|\d))(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))|" +
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(19|20|)\d\d([^- /.\\\d\w]|$|\s))|" +
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(0[1-9]|[12][0-9]|3[01])([^- /.\\\d\w]|$|\s))|" +
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|[12][0-9]|3[01])([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))";
private static Regex dateRegex = new new Regex(pattern,
RegexOptions.Compiled | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
public static void Extract(string text)
{
foreach (Match match in dateRegex.Matches(text))
Console.Writeline("Match {0}",match.Value);
}
The processing time for a 1MB text file which includes 200 matches is ~22secs.
Running the same regex with Java produce a much faster result: ~13Secs.
I managed to reduce the processing time of the .NET code by splitting the regex to parts and parallelize its execution.
Why Java is much faster processing this regex?
What can I do to improve the .NET performance processing this regex?
Cheers,
Doron
dateRegex.Matched(text)without writing to console?