Regex performance in .NET vs Java

Question

I am struggling with regex performance issue while parsing large text files.
I am using .NET 4.0 with the following code:

private static pattern =   
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])\4(0[1-9]|[12][0-9]|3[01]|[0-9])  (\D|$))|" +
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\11(0[1-9]|1[012]|[0-9])  (\D|$))|" + 
@"((\D|^)(0[1-9]|1[012]|[0-9])([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\18(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(0[1-9]|[12][0-9]|3[01]|[0-9])([- /.\\])(0[1-9]|1[012]|[0-9])\25(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(19|20|)\d\d(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(\D|$))|" + 
@"((\D|^)(19|20|)\d\d(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(\D|$))|" + 
@"((\D|^)(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(19|20|)\d\d(\D|$))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(19|20|)\d\d([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(0[1-9]|[12][0-9]|3[01])([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|[12][0-9]|3[01])([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))"; 

private static Regex dateRegex = new new Regex(pattern, 
    RegexOptions.Compiled | RegexOptions.IgnoreCase | 
    RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);

public static void Extract(string text)  
{
   foreach (Match match in dateRegex.Matches(text))
                Console.Writeline("Match {0}",match.Value);
}

The processing time for a 1MB text file which includes 200 matches is ~22secs.
Running the same regex with Java produce a much faster result: ~13Secs.
I managed to reduce the processing time of the .NET code by splitting the regex to parts and parallelize its execution.
Why Java is much faster processing this regex?
What can I do to improve the .NET performance processing this regex?

Cheers,
Doron

This probably doesn't make much of a difference, but have you tried simply executing dateRegex.Matched(text) without writing to console? — snurre
– snurre, Commented Dec 29, 2011 at 15:47
What exactly is the pattern matching? If possible, you should consider DFA based parser or a parser for a subset of context free languages. Ex fslex (and fsyacc) — Lasse Espeholt
– Lasse Espeholt, Commented Dec 29, 2011 at 15:50
Would it make sense to iterate through the string then apply your regex to the next ten (or so) characters each time you encounter a digit? — Bob Kaufman
– Bob Kaufman, Commented Dec 29, 2011 at 15:53
What happens if you remove the compile action, you only need that if you reuse the expression. — Peter
– Peter, Commented Dec 29, 2011 at 16:00
Splitting the regexp up and paralizing the execution could be a really good answer. What you currently have is an unreadable regular expression that tries to do all. Optimizing it will make it even less readable. For instance, you could try and find some "fingerprint" of a date, then check if it is really a date. — Maarten Bodewes
– Maarten Bodewes, Commented Dec 29, 2011 at 17:06

ridgerunner · Accepted Answer · 2011-12-29 19:08:52Z

I don't have time to complete my analysis right now, but I will provide my progress so far. Here is your regex reformatted so that you can actually read it. The only change that I made was to wrap some spaces in a character class to allow for free-spacing mode. There are 80 capture groups (Yipes! - most of which appear to unnecessary). This expression appears to be matching various forms of a Date. There are many rooms for improvement:

private static pattern = @"
    # Match various forms of a Date.
      (                                     # Begin $1:
        (\D|^)                              # $2:
        (19|20|)\d\d                        # $3:
        ([- /.\\])                          # $4:
        (0[1-9]|1[012]|[1-9])               # $5:
        \4
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $6:
        [ ][ ]
        (\D|$)                              # $7:
      )                                     # End $1:
    | (                                     # Begin $8:
        (\D|^)                              # $9:
        (19|20|)\d\d                        # $10:
        ([- /.\\])                          # $11:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $12:
        \11
        (0[1-9]|1[012]|[0-9])               # $13:
        [ ][ ]
        (\D|$)                              # $14:
      )                                     # End $8:
    | (                                     # Begin $15:
        (\D|^)                              # $16:
        (0[1-9]|1[012]|[0-9])               # $17:
        ([- /.\\])                          # $18:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $19:
        \18
        (19|20|)\d\d                        # $20:
        (\D|$)                              # $21:
      )                                     # End $15:
    | (                                     # Begin $22:
        (\D|^)                              # $23:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $24:
        ([- /.\\])                          # $25:
        (0[1-9]|1[012]|[0-9])               # $26:
        \25
        (19|20|)\d\d                        # $27:
        (\D|$)                              # $28:
      )                                     # End $22:
    | (                                     # Begin $29:
        (\D|^)                              # $30:
        (19|20|)\d\d                        # $31:
        (0[1-9]|1[012])                     # $32:
        (0[1-9]|[12][0-9]|3[01])            # $33:
        (\D|$)                              # $34:
      )                                     # End $29:
    | (                                     # Begin $35:
        (\D|^)                              # $36:
        (19|20|)\d\d                        # $37:
        (0[1-9]|[12][0-9]|3[01])            # $38:
        (0[1-9]|1[012])                     # $39:
        (\D|$)                              # $40:
      )                                     # End $35:
    | (                                     # Begin $41:
        (\D|^)                              # $42:
        (0[1-9]|1[012])                     # $43:
        (0[1-9]|[12][0-9]|3[01])            # $44:
        (19|20|)\d\d                        # $45:
        (\D|$)                              # $46:
      )                                     # End $41:
    | (                                     # Begin $47:
        (\D|^)                              # $48:
        (0[1-9]|[12][0-9]|3[01])            # $49:
        (0[1-9]|1[012])                     # $50:
        (19|20|)\d\d                        # $51:
        (\D|$)                              # $52:
      )                                     # End $47:
    | (                                     # Begin $53:
        ( ^                                 # Begin $54:
        | (?<!
            (\d[- /.\\\d])                  # $55:
          | \d
          )
        )                                   # End $54:
        (19|20|)\d\d                        # $56:
        ([- /.\\])                          # $57:
        (0[1-9]|1[012]|[1-9])               # $58:
        ([^- /.\\\d\w]|$|\s)                # $59:
      )                                     # End $53:
    | (                                     # Begin $60:
        ( ^                                 # Begin $61:
        | (?<!
            (\d[- /.\\\d])                  # $62:
          | \d
          )
        )                                   # End $61:
        (0[1-9]|1[012]|[1-9])               # $63:
        ([- /.\\])                          # $64:
        (19|20|)\d\d                        # $65:
        ([^- /.\\\d\w]|$|\s)                # $66:
      )                                     # End $60:
    | (                                     # Begin $67:
        ( ^                                 # Begin $68:
        | (?<!
            (\d[- /.\\\d])                  # $69:
          |\d
          )
        )                                   # End $68:
        (0[1-9]|1[012]|[1-9])               # $70:
        ([- /.\\])                          # $71:
        (0[1-9]|[12][0-9]|3[01])            # $72:
        ([^- /.\\\d\w]|$|\s))               # $73:
    | (                                     # Begin $74:
        ( ^                                 # Begin $75:
        | (?<!
            (\d[- /.\\\d])                  # $76:
          | \d
          )
        )                                   # End $75:
        (0[1-9]|[12][0-9]|3[01])            # $77:
        ([- /.\\])                          # $78:
        (0[1-9]|1[012]|[1-9])               # $79:
        ([^- /.\\\d\w]|$|\s)                # $80:
      )                                     # End $74:
    ";

When I get some more time, I'll update this answer with some recommended improvements. In the meantime, other regex experts please feel free to take this improved partially-commented version and run with it...

ΩmegaMan · Accepted Answer · 2012-01-01 17:34:46Z

There are different flavors of regex parsers out there and the java one may be more suited for the pattern you use. But your pattern an the options you have choosen really slow this one down!

RegexOptions.IgnoreCase (is this necessary?) See Want faster regular expressions? Maybe you should think about that IgnoreCase option...

RegexOptions.IgnorePatternWhitespace : You have ZERO pattern whitespace. This does not slow down the actual regex parsing, for it is only to allow the user to document the pattern. But it is telling because it shows that the creator of the pattern (no offense here really :-) ) does not understand what the regex parser needs to effectivly execute a pattern in a reasonable amounht of time; see next statement as to why its slow.

Your pattern is slow because of the ambiguity of the pattern causes too many backtracking issues; plain and simple.

See Backtracking for more information on what that means.

HTH

Collectives™ on Stack Overflow

Regex performance in .NET vs Java

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related