I am presently writing a Python script to process some 10,000 or so input documents. Based on the script's progress output I notice that the first 400+ documents get processed really fast and then the script slows down although the input documents all are approximately the same size.
I am assuming this may have to do with the fact that most of the document processing is done with regexes that I do not save as regex objects once they have been compiled. Instead, I recompile the regexes whenever I need them.
Since my script has about 10 different functions all of which use about 10 - 20 different regex patterns I am wondering what would be a more efficient way in Python to avoid re-compiling the regex patterns over and over again (in Perl I could simply include a modifier //o).
My assumption is that if I store the regex objects in the individual functions using
pattern = re.compile()
the resulting regex object will not be retained until the next invocation of the function for the next iteration (each function is called but once per document).
Creating a global list of pre-compiled regexes seems an unattractive option since I would need to store the list of regexes in a different location in my code than where they are actually used.
Any advice here on how to handle this neatly and efficiently?