9

This article says that regexp matching in Java is slow because regexps with "back references" cannot be matched efficiently. The article explains efficient Thomson's NFA-based matching algorithm (invented in 1968) which works for regexps without "back references". However the Pattern javadoc says Java regexps use NFA-based approach.

Now I wonder how efficient Java regexp matching is and what algorithm it uses.

8
  • 1
    As usual: be careful with 'slow' and 'efficient'. If it's fast enough for your use case, use it, otherwise replace it. And the easiest (but maybe not the best way) to compare the speed to other solutions is a always: benchmark it. Commented Oct 8, 2013 at 15:06
  • That article was written in 2007 when, IIRC, JDK5 was being replaced by 6. The JDK 5 docs seem to say the same thing re use of NFAs. Commented Oct 8, 2013 at 15:08
  • 3
    Java regex uses NFA. It is very efficient. The specific problem described in the article is called catastrophic backtracking, this can be avoided by writing patterns sensibly. Commented Oct 8, 2013 at 15:13
  • Like Neet is saying, avoid premature optimizations. Regex is pretty fast in Java, do you need it faster? Or is this merely your curiosity? Commented Oct 8, 2013 at 20:09
  • 2
    I love such questions :) +1 Commented Oct 9, 2013 at 15:03

1 Answer 1

1

java.util.regex.Pattern uses Boyer–Moore string search algorithm

/* Attempts to match a slice in the input using the Boyer-Moore string
 * matching algorithm. The algorithm is based on the idea that the
 * pattern can be shifted farther ahead in the search text if it is
 * matched right to left.
 */

private void compile() {
    ----------------------
    -----------------------

   if (matchRoot instanceof Slice) {
        root = BnM.optimize(matchRoot);
        if (root == matchRoot) {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }
    } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
        root = matchRoot;
    } else {
        root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

This is technical correct but misleading: Pattern does use Boyer-Moore, but for the special case of matching a sequence (not for regexp in general), since Boyer-Moore is a substring matching algorithm and that's all it can do. It can't match e.g. the choice or kleene closure (star) of regex. The comment in the above source says match a slice - looking at the rest of the source, a Slice does indeed seem to be a sequence of literals: Pattern.java

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.