1

I want here to submit a very specific performance problem that i want to understand.

Goal

I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.

Case

The regex:

^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$

A valid synthax:

{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]

You could find the regex and a test text here : https://regexr.com/3jama

I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).

Issue

Applying the regex on valid text is not costing much, it's almost instant. But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.

Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.

Reproducing the issue

Just remove the trailing "]" character from the test text

So the invalid text to raise the performance drop becomes:

{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4

Another invalid test could be, and with no permformance drop:

{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]

Request

I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.

14
  • 1
    you need to provide a minimal reproducible example here. Not in a link Commented Jan 17, 2018 at 9:01
  • [\s\t\n\r]* is completely useless - \s* does the same and is more readable. Additionally, the syntax for the square brackets could be [^][{}(),]+ (no need to escape everything). Commented Jan 17, 2018 at 9:16
  • a little explaination of the regex would still be welcome, it is maybe a bit overcomplexified; you could replace all the [\s\t\n\r]* by \s* for example Commented Jan 17, 2018 at 9:16
  • 2
    @rtribotte in the case of a failed match, backtracking starts to kick off. Then constructs like [^\][{}(),]+\,?\s* where characters can be occopied by the character class as well as the \s open the gates to catastrophic backtracking, especially if they appear in repeated groups with no other boundaries around. Commented Jan 17, 2018 at 9:54
  • 1
    @Jan OP mentions JavaScript in the title - in that case you have to escape it. [^] is valid in JS regex. Commented Jan 17, 2018 at 10:03

1 Answer 1

3

This answer is for the condensed regex from your comment:

^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$

The issues are similar for your original pattern.

You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.

One of the subpatterns that you use is the following (multilined for better visualisation):

([^\][{}(),]+
  (\(
    ([^\][{}(),]+\,?)+
  \))?
\,?)+

That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.

I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.

Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:

^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$

Which doesn't catastrophically backtrack anymore.

You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.

Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Sign up to request clarification or add additional context in comments.

1 Comment

First, i want to thank you for your time, it took half an hour until i catch the trick for my case, you've explained it very well, and regular-expressions.info dedicated page as well. I have to go further to understand the use of atomic groups and possessive quantifiers, and see what i could do in java. Many thanks, you put me in the right way, to not say that you did the job for me, and you put a word on the symptoms : Catastrophic Backtracking

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.