Pythonic and efficient way of defining multiple regexes for use over many iterations

Question

I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.

I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.

Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).

My assumption is that if I store the regex objects in the individual functions using

pattern = re.compile()

the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

Any advice here on how to handle this neatly and efficiently?

No, it has to do with the fact that your cache is depleted. — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Mar 28, 2012 at 19:45
are all functions applied to all documents? because if so, @larsmans answer, while good, does not seem to explain the slowdown after 400 documents. i would suggest profiling rather than guessing... — andrew cooke
– andrew cooke, Commented Mar 28, 2012 at 20:07
Sorry, I am not familiar with profiling ... how does it work and what does it do for me? — Pat
– Pat, Commented Mar 30, 2012 at 20:06

unutbu · Accepted Answer · 2012-03-28 20:09:51Z

10

The re module caches compiled regex patterns. The cache is cleared when it reaches a size of re._MAXCACHE which by default is 100. (Since you have 10 functions with 10-20 regexes each (i.e. 100-200 regexes), your observed slow-down makes sense with the clearing of the cache.)

If you are okay with changing private variables, a quick and dirty fix to your program might be to set re._MAXCACHE to a higher value:

import re
re._MAXCACHE = 1000

edited Mar 28, 2012 at 20:09

answered Mar 28, 2012 at 20:04

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Machin · Accepted Answer · 2012-03-28 20:07:12Z

5

Last time I looked, re.compile maintained a rather small cache, and when it filled up, just emptied it. DIY with no limit:

class MyRECache(object):
    def __init__(self):
        self.cache = {}
    def compile(self, regex_string):
        if regex_string not in self.cache:
            self.cache[regex_string] = re.compile(regex_string)
        return self.cache[regex_string]

answered Mar 28, 2012 at 20:07

John Machin

83.3k12 gold badges147 silver badges193 bronze badges

4 Comments

Sven Marnach Over a year ago

Or, even more succinctly, derive from dict and overwrite __missing__().

John Machin Over a year ago

@SvenMarnach: The code that I wrote can be understood by the person without the need to look up the __voodoo__ docs.

Pat Over a year ago

It would be interesting to know how the cache is cleared when its capacity is used up ... are all entries flushed or just a few?

John Machin Over a year ago

@Pat: If you don't believe that "emptied" means "flushed all entries", find re.py in your Python installation (mine is C:\Python27\Lib\re.py) and look for occurrences of _cache ... you should find _cache = {} and _cache.clear()

Fred Foo · Accepted Answer · 2012-03-28 20:02:24Z

2

Compiled regular expression are automatically cached by re.compile, re.search and re.match, but the maximum cache size is 100 in Python 2.7, so you're overflowing the cache.

Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.

You can define them near the place where they are used: just before the functions that use them. If you reuse the same RE in a different place, then it would have been a good idea to define it globally anyway to avoid having to modify it in multiple places.

answered Mar 28, 2012 at 20:02

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Comments

georg · Accepted Answer · 2012-03-28 20:28:08Z

1

In the spirit of "simple is better" I'd use a little helper function like this:

def rc(pattern, flags=0):
    key = pattern, flags
    if key not in rc.cache:
        rc.cache[key] = re.compile(pattern, flags)
    return rc.cache[key]

rc.cache = {}

Usage:

rc('[a-z]').sub...
rc('[a-z]').findall <- no compilation here

I also recommend you to try regex. Among many other advantages over the stock re, its MAXCACHE is 500 by default and won't get dropped completely on overflow.

answered Mar 28, 2012 at 20:28

georg

216k57 gold badges324 silver badges401 bronze badges

1 Comment

Pat Over a year ago

Thanks to everyone who bothered to reply to my query. I will follow up on the many helpful pointers. Your support is much appreciated!.

Collectives™ on Stack Overflow

Pythonic and efficient way of defining multiple regexes for use over many iterations

4 Answers 4

Comments

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related