3

I stumbled upon the following code:

import re 

regex_compiled = re.compile('\d{2}-\d{3,5}')

res = re.search(regex_compiled, '12-9876')

I was under impression that re.search attempts to compile the first parameter which is already compiled so it should error or regex_compiled.__repr__() or regex_compiled.__str__() should be called just before a repeated attempt to compile it!

Just to be sure I compared it with regex_compiled.search(...:

>>> from timeit import timeit
>>> timeit("import re; regex_compiled = re.compile('\d{2}-\d{3,5}');     res = re.search(regex_compiled, '12-9876')")
1.3797054840251803

>>> timeit("import re; regex_compiled = re.compile('\d{2}-\d{3,5}');     res = regex_compiled.search('12-9876')")
0.7649686150252819
>>>

I am very puzzled from where so substantial difference comes from given that debugging into re.search (in both CPython v. 2 and v. 3) shows that the compiled pattern is reused! I hope someone can help shed some light on this.

Execution environment: Ubuntu 16.04, 64b

1
  • 1
    I don't know this is a duplicate. I am fully aware about the caching mechanism and that is why I stumbled on the difference in the timing which is substantial. That difference bothers me. Commented Feb 20, 2018 at 11:10

1 Answer 1

2

re._compile first checks if the argument is cached, and then if it's already compiled. So when you pass a compiled pattern to re.whatever it wastes some time computing and looking up a cache key which actually is never going to match. Repring the pattern and OrderedDict lookups are heavy operations that seem to explain the discrepancy you're observing.

A possible rationale for this behaviour is that _compile is optimized for string patterns, which is its primary use case, and designed to return a cache hit as soon as possible.

Here are some timings:

from time import time
import re
import sys

print(sys.version)

pat = '\d{2}-\d{3,5}'
loops = 1000000

re.purge()

t = time()
for _ in range(loops):
    re._compile(pat, 0)
print('compile string  ', time() - t)

re.purge()

rc = re._compile(pat, 0)
t = time()
for _ in range(loops):
    re._compile(rc, 0)
print('compile compiled', time() - t)

Results:

$ python3 test.py
3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609]
compile string   0.5387749671936035
compile compiled 0.7378756999969482

$ python2 test.py
2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609]
('compile string  ', 0.5074479579925537)
('compile compiled', 1.3561439514160156)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.