REGEX PCRE Recursive expression for nested text matching

Question

So i am trying to do something like this (yes, including newlines):

Match #1

START
    START
        stuff
    STOP
    more stuff
STOP

Match #2

START
    START
        stuff
    STOP
    more stuff
STOP

This is how far i have come

START(.*?^(?:(?!STOP).)*$|(?R))|STOP with the parameters "g" "m" "i" and "s"

The problem is that i cannot match anything after the STOP wihtout matching the last "STOP" in the entire text.

Here is a regex101 example

https://regex101.com/r/vD4nX6/1

I would appriciate some guidance

Thanks in advance

If the problem is matching the last stop, you probably need to make all your matches 'non greedy', something like changing '*$' to '\*?$' or using the capital 'U' parameter. — WhoIsRich
– WhoIsRich, Commented Jun 26, 2016 at 21:54

melpomene · Accepted Answer · 2016-06-26 22:11:03Z

Here's a pattern that matches your example:

^\h*START\h*\n(?:\h*+(?!(?:START|STOP)\h*$)[^\n]*\n|(?R)\n)*\h*STOP\h*$

using the /mg flags (live at https://regex101.com/r/iK9tK5/1).

The idea behind it:

^                                  # beginning of line
\h* START \h* \n                   # "START" optionally surrounded by horizontal whitespace
                                   #   on a line of its own
(?:                                # between START/STOP, every line is either "normal"
                                   #   or a recursive START/STOP block
    \h*+                           # a normal line starts with optional horizontal whitespace
    (?!                            #   ... not followed by ...
        (?: START | STOP ) \h* $   #   "START" or "STOP" on their own
    )
    [^\n]* \n                      # any characters, then a newline
|
    (?R) \n                        # otherwise it's a recursive START/STOP block
)*                                 # we can have as many items as we want between START/STOP
\h* STOP \h*                       # "STOP" optionally surrounded by horizontal whitespace
$                                  # end of line

I've made \h*+ possessive in order to avoid accidentally matching " STOP" by 0 iterations of \h*, not followed by "STOP" (they're followed by " STOP" (with a space)). The + forces \h to match as many times as it possibly can, so it has to consume the space.

Alternatively you could pull \h* into the look-ahead: (?!\h*(?:START|STOP)\h*$)
That would also work, but then the look-ahead would skip over any spaces to see whether they're followed by START/STOP, only to have [^\n]* outside go over those same spaces again. With \h*+ at the start, we match those spaces once, with no backtracking. I guess it's a micro-optimization.

Collectives™ on Stack Overflow

REGEX PCRE Recursive expression for nested text matching

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related