1

I am struggling with regex performance issue while parsing large text files.
I am using .NET 4.0 with the following code:

private static pattern =   
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])\4(0[1-9]|[12][0-9]|3[01]|[0-9])  (\D|$))|" +
@"((\D|^)(19|20|)\d\d([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\11(0[1-9]|1[012]|[0-9])  (\D|$))|" + 
@"((\D|^)(0[1-9]|1[012]|[0-9])([- /.\\])(0[1-9]|[12][0-9]|3[01]|[0-9])\18(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(0[1-9]|[12][0-9]|3[01]|[0-9])([- /.\\])(0[1-9]|1[012]|[0-9])\25(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(19|20|)\d\d(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(\D|$))|" + 
@"((\D|^)(19|20|)\d\d(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(\D|$))|" + 
@"((\D|^)(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])(19|20|)\d\d(\D|$))|" + 
@"((\D|^)(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(19|20|)\d\d(\D|$))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(19|20|)\d\d([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(19|20|)\d\d([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|1[012]|[1-9])([- /.\\])(0[1-9]|[12][0-9]|3[01])([^- /.\\\d\w]|$|\s))|" + 
@"((^|(?<!(\d[- /.\\\d])|\d))(0[1-9]|[12][0-9]|3[01])([- /.\\])(0[1-9]|1[012]|[1-9])([^- /.\\\d\w]|$|\s))"; 

private static Regex dateRegex = new new Regex(pattern, 
    RegexOptions.Compiled | RegexOptions.IgnoreCase | 
    RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);

public static void Extract(string text)  
{
   foreach (Match match in dateRegex.Matches(text))
                Console.Writeline("Match {0}",match.Value);
}

The processing time for a 1MB text file which includes 200 matches is ~22secs.
Running the same regex with Java produce a much faster result: ~13Secs.
I managed to reduce the processing time of the .NET code by splitting the regex to parts and parallelize its execution.
Why Java is much faster processing this regex?
What can I do to improve the .NET performance processing this regex?

Cheers,
Doron

16
  • This probably doesn't make much of a difference, but have you tried simply executing dateRegex.Matched(text) without writing to console? Commented Dec 29, 2011 at 15:47
  • 1
    What exactly is the pattern matching? If possible, you should consider DFA based parser or a parser for a subset of context free languages. Ex fslex (and fsyacc) Commented Dec 29, 2011 at 15:50
  • Would it make sense to iterate through the string then apply your regex to the next ten (or so) characters each time you encounter a digit? Commented Dec 29, 2011 at 15:53
  • What happens if you remove the compile action, you only need that if you reuse the expression. Commented Dec 29, 2011 at 16:00
  • 1
    Splitting the regexp up and paralizing the execution could be a really good answer. What you currently have is an unreadable regular expression that tries to do all. Optimizing it will make it even less readable. For instance, you could try and find some "fingerprint" of a date, then check if it is really a date. Commented Dec 29, 2011 at 17:06

2 Answers 2

3

I don't have time to complete my analysis right now, but I will provide my progress so far. Here is your regex reformatted so that you can actually read it. The only change that I made was to wrap some spaces in a character class to allow for free-spacing mode. There are 80 capture groups (Yipes! - most of which appear to unnecessary). This expression appears to be matching various forms of a Date. There are many rooms for improvement:

private static pattern = @"
    # Match various forms of a Date.
      (                                     # Begin $1:
        (\D|^)                              # $2:
        (19|20|)\d\d                        # $3:
        ([- /.\\])                          # $4:
        (0[1-9]|1[012]|[1-9])               # $5:
        \4
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $6:
        [ ][ ]
        (\D|$)                              # $7:
      )                                     # End $1:
    | (                                     # Begin $8:
        (\D|^)                              # $9:
        (19|20|)\d\d                        # $10:
        ([- /.\\])                          # $11:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $12:
        \11
        (0[1-9]|1[012]|[0-9])               # $13:
        [ ][ ]
        (\D|$)                              # $14:
      )                                     # End $8:
    | (                                     # Begin $15:
        (\D|^)                              # $16:
        (0[1-9]|1[012]|[0-9])               # $17:
        ([- /.\\])                          # $18:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $19:
        \18
        (19|20|)\d\d                        # $20:
        (\D|$)                              # $21:
      )                                     # End $15:
    | (                                     # Begin $22:
        (\D|^)                              # $23:
        (0[1-9]|[12][0-9]|3[01]|[0-9])      # $24:
        ([- /.\\])                          # $25:
        (0[1-9]|1[012]|[0-9])               # $26:
        \25
        (19|20|)\d\d                        # $27:
        (\D|$)                              # $28:
      )                                     # End $22:
    | (                                     # Begin $29:
        (\D|^)                              # $30:
        (19|20|)\d\d                        # $31:
        (0[1-9]|1[012])                     # $32:
        (0[1-9]|[12][0-9]|3[01])            # $33:
        (\D|$)                              # $34:
      )                                     # End $29:
    | (                                     # Begin $35:
        (\D|^)                              # $36:
        (19|20|)\d\d                        # $37:
        (0[1-9]|[12][0-9]|3[01])            # $38:
        (0[1-9]|1[012])                     # $39:
        (\D|$)                              # $40:
      )                                     # End $35:
    | (                                     # Begin $41:
        (\D|^)                              # $42:
        (0[1-9]|1[012])                     # $43:
        (0[1-9]|[12][0-9]|3[01])            # $44:
        (19|20|)\d\d                        # $45:
        (\D|$)                              # $46:
      )                                     # End $41:
    | (                                     # Begin $47:
        (\D|^)                              # $48:
        (0[1-9]|[12][0-9]|3[01])            # $49:
        (0[1-9]|1[012])                     # $50:
        (19|20|)\d\d                        # $51:
        (\D|$)                              # $52:
      )                                     # End $47:
    | (                                     # Begin $53:
        ( ^                                 # Begin $54:
        | (?<!
            (\d[- /.\\\d])                  # $55:
          | \d
          )
        )                                   # End $54:
        (19|20|)\d\d                        # $56:
        ([- /.\\])                          # $57:
        (0[1-9]|1[012]|[1-9])               # $58:
        ([^- /.\\\d\w]|$|\s)                # $59:
      )                                     # End $53:
    | (                                     # Begin $60:
        ( ^                                 # Begin $61:
        | (?<!
            (\d[- /.\\\d])                  # $62:
          | \d
          )
        )                                   # End $61:
        (0[1-9]|1[012]|[1-9])               # $63:
        ([- /.\\])                          # $64:
        (19|20|)\d\d                        # $65:
        ([^- /.\\\d\w]|$|\s)                # $66:
      )                                     # End $60:
    | (                                     # Begin $67:
        ( ^                                 # Begin $68:
        | (?<!
            (\d[- /.\\\d])                  # $69:
          |\d
          )
        )                                   # End $68:
        (0[1-9]|1[012]|[1-9])               # $70:
        ([- /.\\])                          # $71:
        (0[1-9]|[12][0-9]|3[01])            # $72:
        ([^- /.\\\d\w]|$|\s))               # $73:
    | (                                     # Begin $74:
        ( ^                                 # Begin $75:
        | (?<!
            (\d[- /.\\\d])                  # $76:
          | \d
          )
        )                                   # End $75:
        (0[1-9]|[12][0-9]|3[01])            # $77:
        ([- /.\\])                          # $78:
        (0[1-9]|1[012]|[1-9])               # $79:
        ([^- /.\\\d\w]|$|\s)                # $80:
      )                                     # End $74:
    ";

When I get some more time, I'll update this answer with some recommended improvements. In the meantime, other regex experts please feel free to take this improved partially-commented version and run with it...

Sign up to request clarification or add additional context in comments.

Comments

1

There are different flavors of regex parsers out there and the java one may be more suited for the pattern you use. But your pattern an the options you have choosen really slow this one down!

RegexOptions.IgnoreCase (is this necessary?) See Want faster regular expressions? Maybe you should think about that IgnoreCase option...

RegexOptions.IgnorePatternWhitespace : You have ZERO pattern whitespace. This does not slow down the actual regex parsing, for it is only to allow the user to document the pattern. But it is telling because it shows that the creator of the pattern (no offense here really :-) ) does not understand what the regex parser needs to effectivly execute a pattern in a reasonable amounht of time; see next statement as to why its slow.

Your pattern is slow because of the ambiguity of the pattern causes too many backtracking issues; plain and simple.

See Backtracking for more information on what that means.

HTH

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.