I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.
I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):
def timeout_handler():
print 'timeout_handler called'
if __name__ == '__main__':
timer_thread = Timer(8.0, timeout_handler)
parse_thread = Thread(target=parse_data_files, args=(my_args))
timer_thread.start()
parse_thread.start()
parse_thread.join(12.0)
print 'do we ever get here ?'
but I do neither get the timeout_handler called nor the do we ever get here ? line in the output, the code is just stuck in parse_data_files.
Even worse, I can't even stop the program with CTRL-C, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388
I did achieve some success using signals:
signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)
this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C. If I now modify the timeout_handler like this:
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:
- the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
- the timer starts counting right there, not when the actual parsing starts
- because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
- I've read several times now that threads and signals do not mix very well
I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.
Update
the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):
'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'
sample data:
95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -
As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.
`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999
I do also get cases where the regex seems to return right away and reports a non-match.
Update 2
I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.
I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.