16

The following prints ac | a | bbb | c

#!/usr/bin/env perl
use strict;
use warnings;
# use re 'debug';
    
my $str = 'aacbbbcac';
    
if ($str =~ m/((a+)?(b+)?(c))*/) {
    print "$1 | $2 | $3 | $4\n";
}

It seems like failed matches do not reset the captured group variables. What am I missing?

3
  • 3
    What output do you expect? Commented Oct 28, 2013 at 18:20
  • @ikegami I know, its not my pattern, I ran into this on G+ Perl Community and was wondering about it. Commented Oct 28, 2013 at 18:29
  • Besides, failed matches don't reset capture variables. perl -E'"a"=~/(.)/; "b"=~/(..)/; say $1;' Commented Oct 28, 2013 at 18:33

3 Answers 3

23

it seems like failed matches dont reset the captured group variables

There is no failed matches in there. Your regex matches the string fine. Although there are some failed matches for inner groups in some repetition. Each matched group might be overwritten by the next match found for that particular group, or keep it's value from previous match, if that group is not matched in current repetition.

Let's see how regex match proceeds:

  • First (a+)?(b+)?(c) matches aac. Since (b+)? is optional, that will not be matched. At this stage, each capture group contains following part:

    • $1 contains entire match - aac
    • $2 contains (a+)? part - aa
    • $3 contains (b+)? part - null.
    • $4 contains (c) part - c
  • Since there is still some string left to match - bbbcac. Proceeding further - (a+)?(b+)?(c) matches - bbbc. Since (a+)? is optional, that won't be matched.

    • $1 contains entire match - bbbc. Overwrites the previous value in $1
    • $2 doesn't match. So, it will contain text previously matched - aa
    • $3 this time matches. It contains - bbb
    • $4 matches c
  • Again, (a+)?(b+)?(c) will go on to match the last part - ac.

    • $1 contains entire match - ac.
    • $2 matches a this time. Overwrites the previous value in $2. It now contains - a
    • $3 doesn't matches this time, as there is no (b+)? part. It will be same as previous match - bbb
    • $4 matches c. Overwrites the value from previous match. It now contains - c.

Now, there is nothing left in the string to match. The final value of all the capture groups are:

  • $1 - ac
  • $2 - a
  • $3 - bbb
  • $4 - c.
Sign up to request clarification or add additional context in comments.

Comments

3

As odd as it seems this is the "expected" behavior. Here's a quote from the perlre docs:

NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.

Comments

0

I know that you are looking at this academically, but in general, a quantifier at the end of a match is often a code smell. Not only that, a zero-or-more quantifier is even smellier. That pattern will never fail to match because it can always match the zero-times case:

use strict;
use warnings;

my $str = 'xzy';

if ($str =~ m/((a+)?(b+)?(c))*/) {
    print "matched: $1 | $2 | $3 | $4\n";
}

Gives some warnings, but still matches:

matched:  |  |  |
Use of uninitialized value $1 in concatenation (.) or string at .
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $3 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .

Even if this was changed to + for one-or-more matches, that really means you are usually looking for only the last match. In that case, you should rewrite your pattern to find only the last case and not pollute the per-match variables with previous matches. I don't see a good way to do that in this abstract, contextless situation, but maybe it's a global match in scalar context:

use strict;
use warnings;

my $str = 'aacbbbcac';

while($str =~ m/(a+)?(b+)?(c)/g ) {
    print "$& | $1 | $2 | $3 | $4\n";
}

The output does carry the same baggage between matches as before because these are now separate successful matches:

aac | aa |  | c |
bbbc |  | bbb | c |
ac | a |  | c |
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Use of uninitialized value $1 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .

Note that you can't use \G here because the matches overlap.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.