Python: why regular expression is slower than replace() method?

Question

I have next 2 blocks of code:

def replace_re(text):
    start = time.time()
    new_text = re.compile(r'(\n|\s{4})').sub('', text)
    finish = time.time()
    return finish - start

def replace_builtin(text):
    start = time.time()
    new_text = text.replace('\n', '').replace('    ', '')
    finish = time.time()
    return finish - start

Than I call both functions with text param (~500kb of source code of one web-page). I thought replace_re() will be much faster, but results are the next:

replace_builtin() ~ 0.008 sec
replace_re() ~ 0.035 sec (nearly 4.5 times slower!!!)

Why is that?

Why wouldn't the regex be slower? A regular expression engine is much more expensive than iterating over a string and checking for matches. — Xophmeister
– Xophmeister, Commented Jun 29, 2012 at 8:36
+1 for profiling, and for showing your code and results. It's a legitimate question, even if many responses are treating it as common knowledge. — Todd A. Jacobs
– Todd A. Jacobs, Commented Jun 29, 2012 at 8:49

lanzz · Accepted Answer · 2012-06-29 08:35:35Z

6

Because regular expressions are more than 4.5 times more complex than a fixed string replacement.

answered Jun 29, 2012 at 8:35

lanzz

43.4k12 gold badges93 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Todd A. Jacobs Over a year ago

This may be true, but you didn't provide any supporting information for the statement. That's not really very useful, and doesn't answer the OPs question in a meaningful way.

zinking Over a year ago

I don't see this helpful, @VitaliPonomar why this is nice answer? simply because this is almost common sense ?

Vitalii Ponomar Over a year ago

@zinking I mean funny answer. As you can see I've accepted another answer as the most helpfull...

Jon Clements · Accepted Answer · 2012-06-29 08:39:32Z

4

Because an re has to generate a FSM. Then use that to process the string. While a replace can use the underlying string processing functions closer to the lib/OS levels.

answered Jun 29, 2012 at 8:39

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Comments

Todd A. Jacobs · Accepted Answer · 2012-06-29 08:47:03Z

Rule of Thumb

Fixed-string matches should always faster than regular expression matches. You can Google for various benchmarks, or do what you did and perform your own, but in general you can just assume that this will be true except (possibly) in some unusual edge cases.

Why Regular Expressions are Slower

Why this is true has to do with the fact that fixed-string matches don't have backtracking, compilation steps, ranges, character classes, or a host of other features that slow down the regular expression engine. There are certainly ways to optimize regex matches, but I think it's unlikely to beat indexing into a string in the common case.

Use the Source

If you want a better explanation, you could always look at the source code for the relevant modules to see how they do what they do. That would certainly give you more information about why any particular implementation performs as it does.

sarnold · Accepted Answer · 2012-06-29 08:43:16Z

3

Consider what happens in your source string for every single space character -- the regular expression engine must explore the next three characters to see if they are also spaces. Thus every space involves, essentially, looking at the next character also. Most RE engines aren't written quite so ... bluntly ... but it does essentially mean that the engine must return to the start state and potentially throw away data it had built up on the previous match state it was building.

However, searching a string for a fixed pattern (four spaces) can actually be done in sub-linear time. I don't know if the Python replace()is implemented in this fashion or not, but one hopes they've used the tools at their disposal well.

answered Jun 29, 2012 at 8:43

sarnold

105k23 gold badges187 silver badges243 bronze badges

1 Comment

Todd A. Jacobs Over a year ago

+1 for linking to Boyer-Moore...not like you need the rep, though. :)

Collectives™ on Stack Overflow

Python: why regular expression is slower than replace() method?

4 Answers 4

3 Comments

Comments

Rule of Thumb

Why Regular Expressions are Slower

Use the Source

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Rule of Thumb

Why Regular Expressions are Slower

Use the Source

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related