Why my Python regular expression pattern run so slowly?

Question

Please see my regular expression pattern code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re

print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'

It takes 49 seconds to finish, any problem with the pattern?

because there's a zillion different ways your regex can match the string, and the engine is backtracking and trying each of the variations. — Marc B
– Marc B, Commented Dec 12, 2014 at 16:46
More information on what and why a regex is backtracking and how catastrophic it becomes when you don't match: regular-expressions.info/catastrophic.html — Benoît Latinier
– Benoît Latinier, Commented Dec 12, 2014 at 16:49

ivan_pozdeev · Accepted Answer · 2015-06-22 20:24:48Z

7

See Runaway Regular Expressions: Catastrophic Backtracking.

In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.

Constructs like (x+)+ and x+x+ practically guarantee this behaviour.

To detect and fix the problematic constructs, the following concept can be used:

At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
- One way to do this is to
  - always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
  - define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)

edited Jun 22, 2015 at 20:24

answered Dec 12, 2014 at 16:54

ivan_pozdeev

36.6k19 gold badges115 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jpmc26 Over a year ago

Is something of the form (x+y)+ equally dangerous, or is that less likely to be a problem?

ivan_pozdeev Over a year ago

@jpmc26 It isn't unless something matches both x and y or y can be a blank match between x's.

that other guy Over a year ago

It's also worth noting that this is not an inherent problem with regular expressions, but an issue due to a design decision in many common regex libraries. The RE2 engine does not have this issue.

ivan_pozdeev Over a year ago

@thatotherguy What you gave as example is the other approach to matching: text-directed. It does avoid the problem but it cannot use some features like lazy quantifiers or backreferences.

Reed_Xia · Accepted Answer · 2014-12-12 16:51:32Z

0

Just repost the answer and solution in comments from nhahtdh and Marc B:

([A-Za-z\-\s\:\.]+)+ --> [A-Za-z\-\s\:\.]+

Thanks so much to nhahtdh and Marc B!

answered Dec 12, 2014 at 16:51

Reed_Xia

1,4723 gold badges19 silver badges30 bronze badges

Collectives™ on Stack Overflow

Why my Python regular expression pattern run so slowly?

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related