2

So i am trying to do something like this (yes, including newlines):

Match #1

START
    START
        stuff
    STOP
    more stuff
STOP

Match #2

START
    START
        stuff
    STOP
    more stuff
STOP

This is how far i have come

START(.*?^(?:(?!STOP).)*$|(?R))|STOP with the parameters "g" "m" "i" and "s"

The problem is that i cannot match anything after the STOP wihtout matching the last "STOP" in the entire text.

Here is a regex101 example

https://regex101.com/r/vD4nX6/1

I would appriciate some guidance

Thanks in advance

2
  • If the problem is matching the last stop, you probably need to make all your matches 'non greedy', something like changing '*$' to '\*?$' or using the capital 'U' parameter. Commented Jun 26, 2016 at 21:54
  • How about START(?>(?!ST[AO]).|(?0))*STOP Commented Jun 27, 2016 at 13:43

1 Answer 1

3

Here's a pattern that matches your example:

^\h*START\h*\n(?:\h*+(?!(?:START|STOP)\h*$)[^\n]*\n|(?R)\n)*\h*STOP\h*$

using the /mg flags (live at https://regex101.com/r/iK9tK5/1).

The idea behind it:

^                                  # beginning of line
\h* START \h* \n                   # "START" optionally surrounded by horizontal whitespace
                                   #   on a line of its own
(?:                                # between START/STOP, every line is either "normal"
                                   #   or a recursive START/STOP block
    \h*+                           # a normal line starts with optional horizontal whitespace
    (?!                            #   ... not followed by ...
        (?: START | STOP ) \h* $   #   "START" or "STOP" on their own
    )
    [^\n]* \n                      # any characters, then a newline
|
    (?R) \n                        # otherwise it's a recursive START/STOP block
)*                                 # we can have as many items as we want between START/STOP
\h* STOP \h*                       # "STOP" optionally surrounded by horizontal whitespace
$                                  # end of line

I've made \h*+ possessive in order to avoid accidentally matching " STOP" by 0 iterations of \h*, not followed by "STOP" (they're followed by " STOP" (with a space)). The + forces \h to match as many times as it possibly can, so it has to consume the space.

Alternatively you could pull \h* into the look-ahead: (?!\h*(?:START|STOP)\h*$)
That would also work, but then the look-ahead would skip over any spaces to see whether they're followed by START/STOP, only to have [^\n]* outside go over those same spaces again. With \h*+ at the start, we match those spaces once, with no backtracking. I guess it's a micro-optimization.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the idea behind it, this really helped me!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.