Overhead in Python multiprocessing module

Question

I am using the multiprocessing module in Python and expect some overhead launching the process, creating a queue and putting and getting value to/from the queue. However, if the sub-process has enough work to do, I would expect that the overhead would eventually be washed out. Running a simple example (explained below), the runtime of my spawned process is about 10 times that of the same process launched from the parent process, even for very large jobs.

In the following code, I compute the mean of a series of larger and larger arrays. I compare calling numpy.mean from the parent process to calling the same mean function from a single spawned process and to doing nothing in a spawned process (to get an idea of overhead cost).

Initially, the results are as I expect. The total runtime is much faster when mean is called from the parent process than when called from a spawned process. For small jobs, the runtime for the spawned process is dominated by the overhead.

What is surprising, however, is that for larger jobs, the runtime for the spawned process consistently exceeds the cost of calling from the parent process by about a factor of 10.

Can anyone provide an explanation for this? Is this due to memory limitations in the sub-process? The largest arrays I test are 125MB, 500MB and 2GB.

Here is the code :

%matplotlib 
import numpy, multiprocessing, pandas

def do_nothing(x,q):
    q.put(x[-1])

def my_mean(x,q):
    q.put(numpy.mean(x))

def test_mp(f,x):
    q = multiprocessing.Queue()
    p = multiprocessing.Process(target=f,args=(x,q))
    p.start()
    p.join()
    s = q.get()
    return s

ndata = 2**numpy.arange(10,29,2)
tr1,tr2,tr3 = [[],[],[]]
for n in ndata:
    x = numpy.random.rand(n)
    tresults = %timeit -n 1 -r 5 -o -q test_mp(do_nothing,x)
    tr1.append(tresults)

    tresults = %timeit -n 1 -r 5 -o -q test_mp(my_mean,x)
    tr2.append(tresults)

    tresults = %timeit -n 1 -r 5 -o -q numpy.mean(x)
    tr3.append(tresults)

print("All done")

t1,t2,t3 = map(lambda tr : pandas.Series([1000*t.best for t in tr]),[tr1,tr2,tr3])
df = pandas.DataFrame({'n' : ndata, 't1 (do nothing)' : t1, 
                       't2 (my_mean)' : t2, 
                       't3 (mean)'    : t3})
display(df)
df.plot(x='n',style='.-',markersize=10,logx=True,logy=True)

Here are the results. All timing results are in milliseconds.

Each process has its own memory space. Queue transfers data between processes by pickling it (in the first process's memory space) and unpickling it again (in the second process's memory space). For huge data structures this will have a correspondingly huge overhead. If possible you need to initialize the data and crunch it within the same process. There are also methods in the subprocess package (e.g., Array) for sharing some types of data across processes. — Paul Cornelius
– Paul Cornelius, Commented Jan 20, 2018 at 0:48
What surprises me is the timing of do_nothing: as you are not multithreading, the data must be communicated from one process to the other, which is an overhead proportional to n. If I repeat the experiment, I get a different result, showing a slight increase in do_nothing at line 8, then...my PC fills its RAM. — Matteo T.
– Matteo T., Commented Jan 20, 2018 at 0:50
@Matteo T. I tried to set up "do nothing" so that any overhead associated with pass the array to the subprocess was taken into account (notice I put the last entry on the queue). So unless the Python interpreter is really smart, and recognizes that I am really not using x at all in do_nothing, I don't see how the pickling cost is reflected. — Donna
– Donna, Commented Jan 20, 2018 at 0:52
Could you try q.put(x[int(numpy.random.uniform(0,len(x)-1))]) instead of q.put(x[-1])? This is still nothing for the CPU, but python will never predict it ;) — Matteo T.
– Matteo T., Commented Jan 20, 2018 at 0:55

Donna · Accepted Answer · 2018-01-20 02:48:13Z

Here are a few observations about what is going on :

Using the -c flag in %timeit shows CPU times for the parent process only, and in this case, my_mean and do_nothing show essentially the same flat times. So the parent process is using the same CPU time in both cases.
Without the -c flag, the time in the pickling process is being accounted for, at least in the call to my_mean. Why it is not accounted for in do_nothing is still a mystery. Is the Python interpreter smart enough to recognize that do_nothing really does nothing?
What else is a bit of a mystery : who is doing the pickling? The parent process? if so, it doesn't use any CPU time. So it must be the spawned process.
Using the threading module (with queue.Queue(), and threading.Thread(), results are much more in line with what is expected, i.e. for large enough problems, run time is dominated by time it takes to compute the mean, and the call to numpy.mean() and the same call in the spawned process take essentially the same time.

Here are the times for the same problem, using the threading module :

Collectives™ on Stack Overflow

Overhead in Python multiprocessing module

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related