7

Studying PCRE syntax documentation, I came to the "Backtracking Control" section:

The following act immediately they are reached:

  (*ACCEPT)       force successful match
  (*FAIL)         force backtrack; synonym (*F)
  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)

The following act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards. Those that advance the start-of-match point do so only if the pattern is not anchored.

  (*COMMIT)       overall failure, no advance of starting point
  (*PRUNE)        advance to next starting character
  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
  (*SKIP)         advance to current matching position
  (*SKIP:NAME)    advance to position corresponding to an earlier
                  (*MARK:NAME); if not found, the (*SKIP) is ignored
  (*THEN)         local failure, backtrack to next alternation
  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)

Frankly, I understood almost nothing. The only thing I have yet been able to understand about is how the (*SKIP) token works, thanks to the How do (*SKIP) or (*F) work on regex? question. Can anyone explain, preferably with examples, how (*COMMIT), (*PRUNE) and (*THEN) could be used in practice?

3
  • 1
  • 1
    These verbs behave like zero-width assertions, they get triggered only when the engine stumbles on them while backtracking: COMMIT will make the engine "stop looking for matches" at all, PRUNE will make it fail the current character position and go on to search from the next one, and THEN will just "prioritize" the subsequent alternation pattern (as it is meant to be used in alternation groups). You should ideally never use THEN if you always place generic alternation patterns at the end of the group, and all specific ones at the beginning. Commented Sep 30 at 13:42
  • 1
    I guess there are a lot of experts who will tell you they know how these verb things work. But the reality is that inside Perl/PCRE regex style regex these verbs don't behave consistently throughout the regex. Except for SKIP/FAIL I would steer clear of Verbs. Also note that (*ACCEPT) is not really a backtracking verb. It was just thrown in there because there was no other special category. For an example of (*ACCEPT) on usage and how tricky that is see stackoverflow.com/questions/21994677/… Commented Oct 2 at 23:28

1 Answer 1

8

(*PRUNE), (*SKIP), (*COMMIT) and (*THEN) works all the same in the sense:

  • they are triggered if the subpattern after them fails.
  • they are ignored when the subpattern after succeeds
  • they forbid the backtracking mechanism in the subpattern before them. (but not in the subpattern after them)

The difference is only what happens after.


With (*PRUNE) the pattern is tested again but at the next position in the string (if the whole pattern was tested at the position n, then the next try is tested at the position n+1 in the subject string). Note that it isn't different from the normal behavior when a pattern fails, except that it avoids the backtracking steps in the subpattern before it. In other words, this backtracking control verb is useful to fail faster.

Consider the two patterns (in free-spacing mode) with the subject string aaa bbb:

With the first one, the first branch is tried and obviously since there is no letter c in the string, it fails (at the end of the string). But the second branch isn't tested immediately, the first branch has to test all possibilities via the backtracking mechanism (that doesn't change the result here) to be sure there's no letter c before the end of the string. Only after that, the second branch is tried (and fails too). Then the whole pattern is tried at the next position in the subject string and this game continues until the pattern is tested at the position of the first letter b. Then the second branch succeeds. Tedious isn't it?

With the second pattern, same scenario, except that all the backtracking steps in the first branch are avoided (when this one fails, the pattern is immediately tested at the next position in the subject string) and the second branch is tested only when the letter a isn't found (the first branch fails but this time before the verb and this allows the second branch to be tested).


With (*SKIP) the pattern is tested again but after the position reached by the subpattern before the verb. Useful when you want to skip useless positions (or problematic positions) in the string and to advance faster.
(*SKIP:name) does the same thing, except that the next try starts at the position of the marker "name" (*MARKER:name) instead of the position of (*SKIP:name). The marker has to be already known for the regex engine and has to stand before the (*SKIP:name) verb in the pattern.
Note that (*SKIP:name) is the only one verb that has a particular relation with a marker. All other verbs with :name are only dummy shortcuts for (*MARKER:name)(*VERB:name) more or less useful.


With (*COMMIT) the pattern isn't tested again at all. Consider these two patterns (in free-spacing mode) with the subject string aaa bbb:

With the first pattern, the result is the whole string aaa bbb. The first branch fails, the second branch is tested and succeeds.

With the second pattern, there's no match at all because (*COMMIT) is encountered in the first branch that fails (after the verb), so the research is aborted definitively. The second branch is never tested.


(*THEN) is a little different in the sense that it is useful only in an alternation.

Consider these two patterns (in free-spacing mode) with the subject string aaa bbb:

  • ^ a (?: .* \b b | .* b \B ) (demo)

  • ^ a (?: .* (*THEN) \b b | .* b \B ) (demo)

With the first pattern, the result is aaa b.
What happens: the first branch .* \b b is tried: .* reach the end of the string because the quantifier is greedy (it covers aa bbb), the word boundary \b succeeds (between the last b and then end of the string), but there's no more b (only the end of the string).

The backtracking mechanism starts and .* will give back characters one by one until the subpattern \b b succeeds:

  • First backtracking step: .* gives back the last b and covers aa bb, but the word boundary \b fails (between the second and third b).
  • Second backtracking step: .* gives back the second b and covers aa b, the word boundary fails (between the first and second b).
  • Third backtracking step: .* gives back the first b, the word boundary succeeds (between the space and the first b) and the literal b too. The pattern succeeds. Note that the second branch of the alternation is never tested.

With the second pattern, the result is aaa bb.
What happens: when the first branch .* (*THEN) \b b is tried and fails at the end of the string, (*THEN) forbids (locally: i.e. only for this branch of the alternation in this group) the backtracking mechanism to occur. The first branch is abandoned, and the second one is tested. Note that this one succeeds even if it needs one backtracking step (for the \B).

Note that (*THEN) acts only locally and not for the whole pattern. In this example, this action stays confined to the non-capturing group (the innermost where is the (*THEN) verbs). Obviously it isn't the case for a pattern like ^ a .* (*THEN) \b b | ^ a .* b \B where local and global are the same.


About (*MARK:name) or the shorter syntax (*:name): it's used to name a position reached by the pattern in the string. If a pattern succeeds via a path that meets a marker, the name of this one is stored in the "object" of the match result. (The nature of this "object" depends of the implementation).

  • It is useful for debugging purpose, for example to know which branch of a pattern has been used in a successful pattern.
  • It can also be used in conjunction with (*SKIP) to define the position where to retry the pattern (this one may be different from the position where (*SKIP) is encountered):
    bla bla (*MARK:RetryHere) bla bla (*SKIP:RetryHere) blu
    This pattern is retried and succeeds on the second (and last) "blabla" followed by "blu" in the subject string blablablablablu.
  • (*THEN:name) is only the short way to write (*MARK:name)(*THEN) (not very useful): if the subpattern fails the next branch is tried (as explained before), if the subpattern succeeds, the marker is stored somewhere in the result object (whatever it looks like, depending of the implementation).
  • Same thing for (*PRUNE:name), it's only a shortcut for (*MARK:name)(*PRUNE).
Sign up to request clarification or add additional context in comments.

1 Comment

Really appreciate the effort you put into making the answer so detailed and complete (yes, I looked at the edit history) - you definitely succeeded. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.