7

I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.

I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.

Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).

My assumption is that if I store the regex objects in the individual functions using

pattern = re.compile()

the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

Any advice here on how to handle this neatly and efficiently?

5
  • 2
    No, it has to do with the fact that your cache is depleted. Commented Mar 28, 2012 at 19:45
  • 1
    are all functions applied to all documents? because if so, @larsmans answer, while good, does not seem to explain the slowdown after 400 documents. i would suggest profiling rather than guessing... Commented Mar 28, 2012 at 20:07
  • Have you checked how much memory you are using? Commented Mar 28, 2012 at 20:09
  • Sorry, I am not familiar with profiling ... how does it work and what does it do for me? Commented Mar 30, 2012 at 20:06
  • Profiling: docs.python.org/library/profile.html Commented Mar 30, 2012 at 21:38

4 Answers 4

10

The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)

If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE to a higher value:

import re
re._MAXCACHE = 1000
Sign up to request clarification or add additional context in comments.

Comments

5

Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:

class MyRECache(object):
    def __init__(self):
        self.cache = {}
    def compile(self, regex_string):
        if regex_string not in self.cache:
            self.cache[regex_string] = re.compile(regex_string)
        return self.cache[regex_string]

4 Comments

@SvenMarnach: The code that I wrote can be understood by the person without the need to look up the __voodoo__ docs.
It would be interesting to know how the cache is cleared when its capacity is used up ... are all entries flushed or just a few?
@Pat: If you don't believe that "emptied" means "flushed all entries", find re.py in your Python installation (mine is C:\Python27\Lib\re.py) and look for occurrences of _cache ... you should find _cache = {} and _cache.clear()
2

Compiled regular expression are automatically cached by re.compile, re.search and re.match, but the maximum cache size is 100 in Python 2.7, so you're overflowing the cache.

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

You can define them near the place where they are used: just before the functions that use them. If you reuse the same RE in a different place, then it would have been a good idea to define it globally anyway to avoid having to modify it in multiple places.

Comments

1

In the spirit of "simple is better" I'd use a little helper function like this:

def rc(pattern, flags=0):
    key = pattern, flags
    if key not in rc.cache:
        rc.cache[key] = re.compile(pattern, flags)
    return rc.cache[key]

rc.cache = {}

Usage:

rc('[a-z]').sub...
rc('[a-z]').findall <- no compilation here

I also recommend you to try regex. Among many other advantages over the stock re, its MAXCACHE is 500 by default and won't get dropped completely on overflow.

1 Comment

Thanks to everyone who bothered to reply to my query. I will follow up on the many helpful pointers. Your support is much appreciated!.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.