2

Please see my regular expression pattern code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re

print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'

It takes 49 seconds to finish, any problem with the pattern?

3
  • ([A-Za-z\-\s\:\.]+)+ --> [A-Za-z\-\s\:\.]+ Commented Dec 12, 2014 at 16:46
  • 2
    because there's a zillion different ways your regex can match the string, and the engine is backtracking and trying each of the variations. Commented Dec 12, 2014 at 16:46
  • More information on what and why a regex is backtracking and how catastrophic it becomes when you don't match: regular-expressions.info/catastrophic.html Commented Dec 12, 2014 at 16:49

2 Answers 2

7

See Runaway Regular Expressions: Catastrophic Backtracking.

In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.

Constructs like (x+)+ and x+x+ practically guarantee this behaviour.

To detect and fix the problematic constructs, the following concept can be used:

  • At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.

    • One way to do this is to

      • always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
      • define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)
Sign up to request clarification or add additional context in comments.

4 Comments

Is something of the form (x+y)+ equally dangerous, or is that less likely to be a problem?
@jpmc26 It isn't unless something matches both x and y or y can be a blank match between x's.
It's also worth noting that this is not an inherent problem with regular expressions, but an issue due to a design decision in many common regex libraries. The RE2 engine does not have this issue.
@thatotherguy What you gave as example is the other approach to matching: text-directed. It does avoid the problem but it cannot use some features like lazy quantifiers or backreferences.
0

Just repost the answer and solution in comments from nhahtdh and Marc B:

([A-Za-z\-\s\:\.]+)+ --> [A-Za-z\-\s\:\.]+

Thanks so much to nhahtdh and Marc B!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.