1

I need a regex that checks is it a quoted (by ') string with possible escaped \' inside. So, I come up with the following regex, \'(\\.|[^\'])*\'.

"""\'(\\.|[^\'])*\'""".r.findFirstIn(s"'${"a"*100}'")

which works perfectly on small strings, but fails with stack overflow on size > 3000 bytes.

"""\'(\\.|[^\'])*\'""".r.findFirstIn(s"'${"a"*5000}'")

This is Scala snippets. Internally it runs java.util.regex, so it's java/jvm problem.

In my knowledge, those simple regex should not cause stack overflow, it's a simple DFA/NFA without any recursion inside.

How to workaround this issue?

I need regex for that (this is part of parser-combinator code, I can not just write custom code that checks the property).

Why there is recursion inside?

4
  • 1
    Does this contain an answer to your question? ... Or maybe one should answer this one separately, escaping with backslashes seems more common than triple-double-quotes. Commented Dec 3, 2018 at 22:35
  • More specifically: does """\'(\\.|[^\'])*+\'""".r.findFirstIn(s"'${"a"*5000}'") solve your problem? Commented Dec 3, 2018 at 22:43
  • thanks. It works, going to dig into links. A bit more sophisticated than I thought Commented Dec 3, 2018 at 22:59
  • If that works you can wrap the whole thing inside an atomic group \'(?>\\.|[^\'])*\' The issue is going to be the alternations that cause the stack problem If you use the unrolled-loop version it gets rid of the alternation. Commented Dec 3, 2018 at 23:01

2 Answers 2

2

You can try the classical Unrolling the Loop technique outlined by J. Friedl:

'                              # the start delimiter
 ([^\\']*                      # anything but the end of the string or the escape char
         (?:\\.                #     the escape char preceding an escaped char (any char)
               [^\\']*         #     anything but the end of the string or the escape char
                      )*)      #     repeat
                             ' # the end delimiter

Regex101 Demo

Sign up to request clarification or add additional context in comments.

1 Comment

As Scala code: """'([^\\']*(?:\\.[^\\']*)*)'""".r.findFirstIn(s"'${"a"*5000}'").
0

It might be related to RegEx DOS.

Java uses the traditional NFA algorithm [1] to support features such as lazy, backtracking and backreference. NFA 'eats in' a character each time and tries to match it with regexp, and 'spits' it out if it does not match. It will keep spitting until it can find another match (similar to deep first search), and thus bad expressions might cause the RegEx engine to encounter a RegEx DOS, and specifically in Java, it will finally cause a stack overflow for long strings.

According to OWASP, evil regexp expressions contain: Evil Regex pattern contains:

  • Grouping with repetition (1)
  • Inside the repeated group:
    • Repetition
    • Alternation with overlapping (2)

After a brief examination of your regexp expression, it seems that you have (1) and (2) since you have ()* (repetition) and \\.|[^\'] (overlapping), thus I believe you may have to restructure your RegEx expression to avoid RegEx DOS.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.