Code that I'm running using Python's multiprocessing module hangs with no warnings or errors. I think I've narrowed it down to when plots are generated. Is there some incompatibility between multiprocessing and matplotlib?
I'm preprocessing a large number of datasets in Python (using numpy, scipy, pandas). Each dataset is made up of a collection of separate data files. I read in the raw data and write one .pkl file and a handful of .png files for each dataset. Plots are generated using matplotlib and seaborn. Figures are saved to file without being displayed. The preprocessing for each dataset should be completely independent of each other.
Processing serially works. preprocess.main_debug() takes in path/filename/flags and returns a status string ('complete', 'skipped', etc.):
import preprocess
# Serial processing
dataroot = '/Volumes/ExtData/'
study = 'study0'
datasets = ['data0', 'data1', 'data2']
force_preprocess = True
quiet_console = False
status = [preprocess.main_debug(dataroot, study, dataset,
force_preprocess, quiet_console)
for dataset in datasets]
# Print summary
print('\n---- Summary --------------')
for d, s in zip(datasets, status):
print(' {}:\t{}'.format(d, s))
But multiprocessing hangs:
import multiprocessing as mp
import logging
import preprocess
dataroot = '/Volumes/ExtData/'
study = 'study0'
datasets = ['data0', 'data1', 'data2']
force_preprocess = True
quiet_console = True # Suppress console output
# Send multiprocessing logs to console
mp.log_to_stderr()
logger = mp.get_logger()
logger.setLevel(logging.INFO)
# Parallel process
pool = mp.Pool(processes=3, maxtasksperchild=1)
results = [pool.apply_async(preprocess.main_debug,
args=(dataroot, study, dataset,
force_preprocess, quiet_console))
for dataset in datasets]
status = [p.get(timeout=None) for p in results]
# Print summary
print('\n---- Summary --------------')
for d, s in zip(datasets, status):
print(' {}:\t{}'.format(d, s))
I've fiddled around with the number of processes, maxtasksperchild, and timeout to no effect. I found some links online indicating that there might be some incompatibility between logging and multiprocessing, so I removed all the logging code, but execution hangs in the same way.
When I run the multiprocessing version of the code, I see this in the console.
$ python batchpreprocess.py
[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-3] child process calling self.run()
After 7 minutes or so, the CPU usage drops from 100% to 0% and memory usage drops from ~12GB to ~3MB. I then see that 3 more child processes are started. Things stay stuck in this state (overnight, at least). Seems strange to me since I'm only testing with 3 datasets, so I expected only 3 child processes total.
$ python batchpreprocess.py
[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-3] child process calling self.run()
[INFO/PoolWorker-4] child process calling self.run()
[INFO/PoolWorker-5] child process calling self.run()
[INFO/PoolWorker-6] child process calling self.run()
I sprinkled my code with logging statements. It crashes where I have plotting code that will generate a plot of the waveforms. If I remove the plotting code, execution will continue through that point, but then it hangs at the next plot.
The contents of preprocess.main_debug() looks like this:
def main_debug(dataroot, study, dataset, force_preprocess, quiet_console):
try:
status = main(dataroot, study, dataset,
force_preprocess, quiet_console)
return status
except:
print('Problem in dataset {}'.format(dataset))
return 'Exception'
def main(dataroot, study, dataset, force_preprocess, quiet_console):
...
[load files, do signal processing, make plots, save .pkl file]
...
return 'Done'
I need to have plots made as part of the preprocessing. (Plotting from the saved pkl files is possible, but would require re-executing most of the code.) I'm hoping someone else has run across something similar and knows a work-around.
Thanks,
Derek
Python 2.7, OSX High Sierra, just updated all my packages using anaconda.
preprocess.main_debug? Or at least the parameters and the initial setup of variables in the function?