0

I have this reference working Perl script with a regex, copied from a Java snippet that isn't giving the expected results:

my $regex = '^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$';
if ("A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A" =~ /$regex/)
{
    print "Matches 1=$1 2=$2 3=$3 4=$4\n";
}

This correctly outputs:

Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A

Now the equivalent Java snippet:

private static final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
private static final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(MutableUniqueIdentity.NON_SYSTEM_TYPE_REGEX);
    ...

final Matcher match = MutableUniqueIdentity.NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);

The uniqueIdentity input is further back in the stack trace (in a unit test) and is this value:

final String id5CompactString = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";

NOTE: The regex and uniqueIdentity values were copied to the Perl program from a debug session to assert if a different language comes up with a different result (which it did).

ADDITIONAL NOTE: The reason the non-capture group is there is to allow the third element in the string to be optional, so it has to deal with both of these:

   A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A
   A-PROD-COMP-00000000-0000-8033-0000-000200354F0A

My unit test fails in Java - the third match group, which should be LOGL, is in fact 0000.

Here is a screenshot of the debugger right after the regex match line above: enter image description here

You can see that the pattern matches, you can verify that the input parameter (text) and regex are the same as the Perl script, but the result is different!

So my question is: Why does match.groups(3) have a value of 0000 (when it should have a value LOGL) and how does that related back to the regex and the string it is applied to?

In Perl it yields the correct result - LOGL.

Additional info: I have perused this page that highlights the differences between Perl and Java regex engines, and there doesn't appear to be anything applicable.

2
  • javascript (ECMA) regex implementation neither matches your pattern regex101.com/r/vqUUTU/1 (but others do) Commented Jan 16, 2021 at 12:32
  • No matches or gives a different answer? The pattern matches in Java and Perl, just gives a different result. Commented Jan 16, 2021 at 12:33

2 Answers 2

1

Replace your regex with the following regex:

^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
This has been moved out----------^

I have moved - out of the non-capturing group.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
        final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(NON_SYSTEM_TYPE_REGEX);
        String uniqueIdentity = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
        final Matcher match = NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);

        if (match.find()) {
            System.out.printf("Matches 1=%s 2=%s 3=%s 4=%s%n", match.group(1), match.group(2), match.group(3),
                    match.group(4));
        }
    }
}

Output:

Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A

Check the demo at regex101 as well.

Sign up to request clarification or add additional context in comments.

2 Comments

My fault for not explaining the regex. The non-capture group allows for an optional sub-component, so: A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A and A-PROD-COMP-00000000-0000-8033-0000-000200354F0A should be matchable, which is why the - must stay in the non-capture group.
The irony is the tool you linked (good tool by the way)... regex101 ... when corrected back to my original regex, shows the correct result with my regex...!!!
0

Ok I've made it work, but I don't understand why.

The regex needs to be made non-greedy, so instead of:

^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$

it needs to be:

^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*?-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$

(with the extra ? after the * of the non-capture group)

6 Comments

@ikegami this is my point - I don't understand why the greediness modifier has made a difference in this particular example. And to your first comment - I can't see what a "greedy" match would match differently and still have a valid result...
Also the final clincher for me that something isn't right here is that both Perl and the regex01 website work perfectly without the non-greedy modifier - so please explain why the greediness makes sense.
Surely that's the point of Stack Overflow - to talk in the context of the question, not in isolation of a point, itself taken out of context.
SO isn't about talking at all. SO is not a discussion forum. I identified problems with the Answer. And then I fixed them. I figured you'd want an explanation. I have deleted the comments.
There we no problems with the answer. You deleted comments that were my thought processes about why the answer was illlogical. You deleted your comments saying you didn't read the question - and you deleted my comments asking regex experts to share an explanation. So let's please do that - ask a regex expert to share their explanation, if they have one, that is the most constructive way forwards.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.