12

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"

In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.

Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?

A code sample for clarity:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("six|sixty");
    Matcher m = p.matcher("The year was nineteen sixty five.");
    if (m.find())
    {
        System.out.println(m.group());
    }
    else
    {
        System.out.println("wtf?");
    }
}

This outputs "six"

1
  • Where there might be a (not-so-good) way. Separate the above pattern into separate patterns and perform the adjudication yourself. Commented Oct 20, 2016 at 16:05

2 Answers 2

20

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).

"\\b(I|II|III|IV)\\b"

EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

Sign up to request clarification or add additional context in comments.

7 Comments

+1 for pointing out that this is an NFA vs DFA thing. What does "regex-directed flavor" mean, though? Another option for the OP would be to arrange the alternatives such that none are preceded by any of their prefixes. For literal patterns the easiest way to to this is just to sort by descending size.
Thanks much. I'm disappointed I can't force Java to default to longest-match, but quite satisfied with the explanation of why, and the workarounds you both provided.
@Laurence: There's a brief description of regex-directed vs text-directed engines here: regular-expressions.info/engine.html ...and Friedl covers it very thoroughly in The Book: oreilly.com/catalog/regex3/index.html
@Laurence: And yes, sorting alternatives by descending length is a good practice. I would also recommend not using alternation at all if there's another reasonable option. For example, using six(?:ty)? instead of six|sixty or sixty|six.
@Alan: Thanks. I know the NFA versus DFA distinction, I'd just never heard these "regex-directed" and "text-directed" terms used for them before. Do you know where these terms originate?
|
4

I think a pattern that will work is something like

IV|I{1,3}

See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.

Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).

Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

9 Comments

So that fixes this particular case, but not the behavior in general. See the code sample I just added.
This is because you keep using alternation when you don't really want alternation. I will update my answer to explain in more detail.
The "standard behavior" you're talking about is a particular standard (POSIX) - to my knowledge this is the only standard that requires that alternation return the longest match. Given that there is nothing particular about alternation that suggests that the longest match is the correct one, there isn't any reason to expect that an arbitrary matching algorithm should choose the longest match.
Given the regexp "six|sixty", if the rule is not "longest match", it's hard to see what the rule would be. If the rule were "shorter match" or "first listed", "six|sixty" would be identical to "six", which is just foolish. What's left?
@danben -- I hope it was obvious I was criticizing the creator not the creation.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.