3

I'm trying to learn about Recursion in Regular Expressions, and have a basic understanding of the concepts in the PCRE flavour. I want to break a string:

Geese (Flock) Dogs (Pack) 

into:

Full Match: Geese (Flock) Dogs (Pack) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)

I know neither regex quite does this, but I was more curious as to the reason why the the first pattern works, but the second one doesn't.

Pattern 1: ((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Pattern 2: ((.*?)(\(\w{1,}\)))((\g<2>)(\g<3>))*

Also, if for example you're dealing with a long string, and a pattern repeats itself, is it possible to continually expand the full match, and incrementally increase the groups without writing a loop statement separate to the regex.

Full Match: Geese (Flock) Dogs (Pack) Elephants (Herd) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
Group 7: Elephants (Herd)
Group 8: Elephants 
Group 9: (Herd)

This is the closest I've came to was this pattern, but the middle group: Dogs (Pack) becomes Group 0.

((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
7
  • Regarding Q2, repeated capturing groups only keep the last matched occurrence, it is a common question on SO. Q1 is easy to answer but will take time to explain: recursion levels are atomic in PCRE. Commented Aug 17, 2018 at 18:02
  • Also, see this thread, it might even be a dupe reason. Commented Aug 17, 2018 at 18:07
  • Thanks will look through it. Commented Aug 17, 2018 at 18:11
  • Now, if you have read my answer at that thread, is it all clear? Or do you need any clarifications? Commented Aug 17, 2018 at 20:37
  • Regarding the first part, I've gone through the link, and your SO answer, and it still just isn't clicking. I understand with atomicity, that once it finds the match, it doesn't re-enter it. Is what I'm doing wrong here actually related to the use of the first group, is it do with the positioning of the group, or is it a combination of the two? Commented Aug 18, 2018 at 19:53

1 Answer 1

1

Mind that recursion levels in PCRE are atomic. Once these patterns find a match they are never re-tried.

See Recursion and Subroutine Calls May or May Not Be Atomic:

Perl and Ruby backtrack into recursion if the remainder of the regex after the recursion fails. They try all permutations of the recursion as needed to allow the remainder of the regex to match. PCRE treats recursion as atomic. PCRE backtracks normally during the recursion, but once the recursion has matched, it does not try any further permutations of the recursion, even when the remainder of the regex fails to match. The result is that Perl and Ruby may find regex matches that PCRE cannot find, or that Perl and Ruby may find different regex matches.

Your second pattern, at the first recursion level, will look like

((.*?)(\(\w{1,}\)))(((?>.*?))((?>\(\w{1,}\))))*
                     ^^^^^^^  ^^^^^^^^^^^^^^

See demo. That is, \g<2> is then (?>.*?), not .*?. That means that, after the ((.*?)(\(\w{1,}\))) pattern matched Geese (Flock), the regex engine tries to match with (?>.*?), sees it is a lazy pattern that does not have to consume any chars, skips it (and will never come back to this pattern), and tries to match with (?>\(\w{1,}\)). As there is no ( after ), the regex returns what it consumed.

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.