2

I have next 2 blocks of code:

def replace_re(text):
    start = time.time()
    new_text = re.compile(r'(\n|\s{4})').sub('', text)
    finish = time.time()
    return finish - start

def replace_builtin(text):
    start = time.time()
    new_text = text.replace('\n', '').replace('    ', '')
    finish = time.time()
    return finish - start

Than I call both functions with text param (~500kb of source code of one web-page). I thought replace_re() will be much faster, but results are the next:

  1. replace_builtin() ~ 0.008 sec
  2. replace_re() ~ 0.035 sec (nearly 4.5 times slower!!!)

Why is that?

2
  • 3
    Why wouldn't the regex be slower? A regular expression engine is much more expensive than iterating over a string and checking for matches. Commented Jun 29, 2012 at 8:36
  • 2
    +1 for profiling, and for showing your code and results. It's a legitimate question, even if many responses are treating it as common knowledge. Commented Jun 29, 2012 at 8:49

4 Answers 4

6

Because regular expressions are more than 4.5 times more complex than a fixed string replacement.

Sign up to request clarification or add additional context in comments.

3 Comments

This may be true, but you didn't provide any supporting information for the statement. That's not really very useful, and doesn't answer the OPs question in a meaningful way.
I don't see this helpful, @VitaliPonomar why this is nice answer? simply because this is almost common sense ?
@zinking I mean funny answer. As you can see I've accepted another answer as the most helpfull...
4

Because an re has to generate a FSM. Then use that to process the string. While a replace can use the underlying string processing functions closer to the lib/OS levels.

Comments

3

Rule of Thumb

Fixed-string matches should always faster than regular expression matches. You can Google for various benchmarks, or do what you did and perform your own, but in general you can just assume that this will be true except (possibly) in some unusual edge cases.

Why Regular Expressions are Slower

Why this is true has to do with the fact that fixed-string matches don't have backtracking, compilation steps, ranges, character classes, or a host of other features that slow down the regular expression engine. There are certainly ways to optimize regex matches, but I think it's unlikely to beat indexing into a string in the common case.

Use the Source

If you want a better explanation, you could always look at the source code for the relevant modules to see how they do what they do. That would certainly give you more information about why any particular implementation performs as it does.

Comments

3

Consider what happens in your source string for every single space character -- the regular expression engine must explore the next three characters to see if they are also spaces. Thus every space involves, essentially, looking at the next character also. Most RE engines aren't written quite so ... bluntly ... but it does essentially mean that the engine must return to the start state and potentially throw away data it had built up on the previous match state it was building.

However, searching a string for a fixed pattern (four spaces) can actually be done in sub-linear time. I don't know if the Python replace()is implemented in this fashion or not, but one hopes they've used the tools at their disposal well.

1 Comment

+1 for linking to Boyer-Moore...not like you need the rep, though. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.