Regular expression matching algorithm in Java

Question

This article says that regexp matching in Java is slow because regexps with "back references" cannot be matched efficiently. The article explains efficient Thomson's NFA-based matching algorithm (invented in 1968) which works for regexps without "back references". However the Pattern javadoc says Java regexps use NFA-based approach.

Now I wonder how efficient Java regexp matching is and what algorithm it uses.

As usual: be careful with 'slow' and 'efficient'. If it's fast enough for your use case, use it, otherwise replace it. And the easiest (but maybe not the best way) to compare the speed to other solutions is a always: benchmark it. — Neet
– Neet, Commented Oct 8, 2013 at 15:06
That article was written in 2007 when, IIRC, JDK5 was being replaced by 6. The JDK 5 docs seem to say the same thing re use of NFAs. — Mike Samuel
– Mike Samuel, Commented Oct 8, 2013 at 15:08
Java regex uses NFA. It is very efficient. The specific problem described in the article is called catastrophic backtracking, this can be avoided by writing patterns sensibly. — Boris the Spider
– Boris the Spider, Commented Oct 8, 2013 at 15:13
Like Neet is saying, avoid premature optimizations. Regex is pretty fast in Java, do you need it faster? Or is this merely your curiosity? — telkins
– telkins, Commented Oct 8, 2013 at 20:09

Alan Moore · Accepted Answer · 2013-10-08 20:05:39Z

1

java.util.regex.Pattern uses Boyer–Moore string search algorithm

/* Attempts to match a slice in the input using the Boyer-Moore string
 * matching algorithm. The algorithm is based on the idea that the
 * pattern can be shifted farther ahead in the search text if it is
 * matched right to left.
 */

private void compile() {
    ----------------------
    -----------------------

   if (matchRoot instanceof Slice) {
        root = BnM.optimize(matchRoot);
        if (root == matchRoot) {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }
    } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
        root = matchRoot;
    } else {
        root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
    }
}

edited Oct 8, 2013 at 20:05

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

answered Oct 8, 2013 at 15:32

Prabhakaran Ramaswamy

26.1k10 gold badges61 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hyperpallium Over a year ago

This is technical correct but misleading: Pattern does use Boyer-Moore, but for the special case of matching a sequence (not for regexp in general), since Boyer-Moore is a substring matching algorithm and that's all it can do. It can't match e.g. the choice or kleene closure (star) of regex. The comment in the above source says match a slice - looking at the rest of the source, a Slice does indeed seem to be a sequence of literals: Pattern.java

Collectives™ on Stack Overflow

Regular expression matching algorithm in Java

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related