1
public static void main(String[] args) {

        Pattern compile = Pattern
                .compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
        Matcher matcher = compile.matcher("i5-2450M");
        matcher.find();
        System.out.println(matcher.group(0));
    }

I assume this should return i5-2450M but it returns i5 actually

2
  • 2
    You could include word boundaries in your match. Commented Aug 21, 2012 at 4:50
  • 1
    The limit with regex is more often determined by the limitations of the developer, i.e. how much you can easily understand. If you read this code in six months time, how much will be obvious to you? Commented Aug 21, 2012 at 7:55

2 Answers 2

4

The problem is that the first alternation that matches is used.

In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.

// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]

(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)

To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)

The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.


From Java regex alternation operator "|" behavior seems broken:

Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

(The accepted answer in this question uses word boundaries.)

From Pattern:

The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes that's all truth , but this still can't solve my problem. You just tell me why and not how.
@ruby-boy Also consider that such a general regular expression approach may or may not be .. ideal .. based on exact goals/requirements. Here is an incomplete list of just Intel process nomenclatures.
0

Try to iterate over the matches (i.e. while matcher(text).find())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.