1

Given a set of java regular expression patterns separated by an OR (i.e | ), is there any specific precedence that the patterns will follow.

Example code:-

    List<String> columnValues = new ArrayList<String>

    String []columnPatterns = new String[] { "(\\S\\s?)+", "(\\S\\s?)+",
                "(\\d+,?)+\\.\\d+ | \\d+:\\d+", "(\\S\\s?)+",
                "-?\\$?(\\d+,?)+\\.\\d+" };

    String searchString = "Text1            This is Text 2                                          129.80";

    int findFrom = 0;
    int columnIndex = 0;
    List<String> columnValues = new ArrayList<String>();
    for (String pattern : columnPatterns) {
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(searchString);
        if (m.find(findFrom)) {
            columnValues.add(columnIndex++,
                    searchString.substring(m.start(), m.end()).trim());
            findFrom = m.end();
        }
    }

    for (String value : columnValues) {
        System.out.println("<" + value + ">");
    }

The above code yields the following result:-

    <Text1>
    <This is Text 2>
    <129.80>

But if I change the pattern at index position 2 in the columnPatterns array from "(\d+,?)+\.\d+ | \d+:\d+" to "(\d+,?)+\.\d+ | \d+:\d+ | \d+" as shown below:-

    columnPatterns = new String[] { "(\\S\\s?)+", "(\\S\\s?)+",
                "(\\d+,?)+\\.\\d+ | \\d+:\\d+ | \\d+", "(\\S\\s?)+",
                "-?\\$?(\\d+,?)+\\.\\d+" };

I get the following result:-

   <Text1>
   <This is Text 2>
   <129>
   <.80>

Does this mean there is some kind of implicit precedence getting applied or is there some other reason behind this and what could be a solution/work around for this behaviour?

Edit: Also, why does the code behave the way it does.

1 Answer 1

3

Given a set of java regular expression patterns separated by an OR (i.e | ), is there any specific precedence that the patterns will follow

Left to right. At each position in the string each alternation will be tested in order, the one that matches first will be the final match (unless backtracked later).

In your case the last alternation will match first because you have a space at the beginning of it, thus it can match when the previous alternations do not.

For example matching the pattern \d| \d on the string foo 7, the second alternation will match first at the index 3 in the string. The first could not match at that position, and would only be able to match at index 4.

Sign up to request clarification or add additional context in comments.

3 Comments

I thought the same. But considering that the precedence is from left to right, I am still unable to understand why my program behaved this way, since the rightmost pattern i.e \\d+ was the one that matched before the left most pattern i.e (\\d+,?)+\\.\\d+ for the pattern "(\\d+,?)+\\.\\d+ | \\d+:\\d+ | \\d+" , which is totally opposite of what should have happened
Thanks for pointing out the space. Would have missed that for sure. I don't have access to my computer right now to check if removing the space would solve the problem, but seeing that (\\S\\s?)+ matches "Text1", (\\S\\s?)+ matches "This is Text 2" and (\\d+,?)+\\.\\d+ in (\\d+,?)+\\.\\d+ | \\d+:\\d+ | \\d+ matches "129.80" i.e, the first alternation from the three alternations itself goes through, why is the behavior still not as expected?
I tried your suggestion. The spaces do make a difference in the behavior of the code. Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.