3

Code that I'm running using Python's multiprocessing module hangs with no warnings or errors. I think I've narrowed it down to when plots are generated. Is there some incompatibility between multiprocessing and matplotlib?

I'm preprocessing a large number of datasets in Python (using numpy, scipy, pandas). Each dataset is made up of a collection of separate data files. I read in the raw data and write one .pkl file and a handful of .png files for each dataset. Plots are generated using matplotlib and seaborn. Figures are saved to file without being displayed. The preprocessing for each dataset should be completely independent of each other.

Processing serially works. preprocess.main_debug() takes in path/filename/flags and returns a status string ('complete', 'skipped', etc.):

import preprocess

# Serial processing
dataroot = '/Volumes/ExtData/'
study = 'study0'
datasets = ['data0', 'data1', 'data2']
force_preprocess = True
quiet_console = False

status = [preprocess.main_debug(dataroot, study, dataset,
                                force_preprocess, quiet_console)
          for dataset in datasets]

# Print summary
print('\n---- Summary --------------')
for d, s in zip(datasets, status):
    print(' {}:\t{}'.format(d, s))

But multiprocessing hangs:

import multiprocessing as mp
import logging
import preprocess

dataroot = '/Volumes/ExtData/'
study = 'study0'
datasets = ['data0', 'data1', 'data2']
force_preprocess = True
quiet_console = True  # Suppress console output

# Send multiprocessing logs to console
mp.log_to_stderr()
logger = mp.get_logger()
logger.setLevel(logging.INFO)

# Parallel process
pool = mp.Pool(processes=3, maxtasksperchild=1)
results = [pool.apply_async(preprocess.main_debug,
                            args=(dataroot, study, dataset,
                            force_preprocess, quiet_console)) 
           for dataset in datasets]
status = [p.get(timeout=None) for p in results]

# Print summary
print('\n---- Summary --------------')
for d, s in zip(datasets, status):
  print(' {}:\t{}'.format(d, s))

I've fiddled around with the number of processes, maxtasksperchild, and timeout to no effect. I found some links online indicating that there might be some incompatibility between logging and multiprocessing, so I removed all the logging code, but execution hangs in the same way.

When I run the multiprocessing version of the code, I see this in the console.

$ python batchpreprocess.py 
[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-3] child process calling self.run()

After 7 minutes or so, the CPU usage drops from 100% to 0% and memory usage drops from ~12GB to ~3MB. I then see that 3 more child processes are started. Things stay stuck in this state (overnight, at least). Seems strange to me since I'm only testing with 3 datasets, so I expected only 3 child processes total.

$ python batchpreprocess.py 
[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-3] child process calling self.run()
[INFO/PoolWorker-4] child process calling self.run()
[INFO/PoolWorker-5] child process calling self.run()
[INFO/PoolWorker-6] child process calling self.run()  

I sprinkled my code with logging statements. It crashes where I have plotting code that will generate a plot of the waveforms. If I remove the plotting code, execution will continue through that point, but then it hangs at the next plot.

The contents of preprocess.main_debug() looks like this:

def main_debug(dataroot, study, dataset, force_preprocess, quiet_console):  
    try:
        status = main(dataroot, study, dataset,
                      force_preprocess, quiet_console)
        return status
    except:
        print('Problem in dataset {}'.format(dataset))
        return 'Exception'

def main(dataroot, study, dataset, force_preprocess, quiet_console):
    ...
    [load files, do signal processing, make plots, save .pkl file]
    ...
    return 'Done'

I need to have plots made as part of the preprocessing. (Plotting from the saved pkl files is possible, but would require re-executing most of the code.) I'm hoping someone else has run across something similar and knows a work-around.

Thanks,

Derek

Python 2.7, OSX High Sierra, just updated all my packages using anaconda.

6
  • I've had issues with matplotlib not ever finishing the graph I was trying to make. It turned out to be that there were just too many data points and it was hanging on that. You should try to truncate your data set and graph just that small portion of it to see if this might be what's happening. Commented Dec 21, 2017 at 22:34
  • I don't see any plotting code. so are you sure that's where it's hanging? what backend are you using? If it's not AGG, try AGG Commented Dec 21, 2017 at 23:03
  • it seems like you run into a deadlock somehow. This can only happen if the processes share resources, and usually in multiprocessing that's less likely to happen than in multithreading. From what I see in your post everything in the setup of multiprocessing seems correct. Can you add the code of preprocess.main_debug? Or at least the parameters and the initial setup of variables in the function? Commented Dec 22, 2017 at 6:02
  • Thanks for all the suggestions. Turns out @PaulH's suggestion was right on. I was using MacOSX backend. I switched to Agg and the script runs to completion. Brilliant! :) Commented Dec 22, 2017 at 7:54
  • BTW, I'm new to posting on the forums... what's the best way to close this out? Commented Dec 22, 2017 at 7:59

1 Answer 1

3

If you have matplotlib set use an interactive backend, the plots will create windows that require closing for the main loop to continue.

To avoid this, use a non-interactive backend such as "agg".

You can set the parameter in your matplotlibrc file.

You can also, prior to importing pyplot, you can do:

import matplotlib
matplotlib.use('agg')
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, Paul. I never called show() in my scripts, so I didn't expect any windows to pop up. But I did see something in the documentation about macosx being anomalous. "Cocoa rendering in OSX windows (presently lacks blocking show() behavior when matplotlib is in non-interactive mode)". Maybe that's related. Thanks again for the help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.