I used a 2 workers pool to split a sum in a function with Python multiprocessing but the timing doesn't speed up, is there something I'm missing?

Question

The Python code file is provided below. I'm using Python 3.10.12 on a Linux mint 21.3 (in case any of these info are needed). The one with a pool of 2 workers takes more time than the one without any multiprocessing. What am I doing wrong here?

import multiprocessing
import time
import random

def fun1( x ):
    y = 0
    for i in range( len( x ) ):
        y = y + x[i]
    return( y )

def fun2( x ):
    p = multiprocessing.Pool( 2 )
    y1, y2 = p.map( fun1, [ x[ : len( x ) // 2 ], x[ len( x ) // 2 : ] ] )
    y = y1 + y2
    return( y )

x = [ random.random() for i in range( 10 ** 6 ) ]

st = time.time()
ans = fun1( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

st = time.time()
ans = fun2( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

x = [ random.random() for i in range( 10 ** 7 ) ]

st = time.time()
ans = fun1( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

st = time.time()
ans = fun2( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

Here is what I get in terminal.

time = 0.043381452560424805, answer = 499936.40420325665.
time = 0.1324300765991211, answer = 499936.40420325927.
time = 0.4444568157196045, answer = 5000677.883536603.
time = 0.8388040065765381, answer = 5000677.883536343.

I also used the if __name__ == '__main__': line after fun2 and before the rest, I get the same results on terminal. I also tried Python 3.6.2 on a Codio server. I got similar timings.

time = 0.048882484436035156, answer = 499937.07655266096.
time = 0.15220355987548828, answer = 499937.0765526707.
time = 0.4848289489746094, answer = 4999759.127770024.
time = 1.4035391807556152, answer = 4999759.127769606.

I guess it is something wrong with what I'm doing in my code, a misunderstanding on how to use the multiprocessing.Pool rather than Python, but can't think of what. Any help will be appreciated. I expect using two workers I get a speed up by factor two, not a speed down. Also if needed, I checked with multiprocessing.cpu_count(), the Codio server has 4 and my computer has 12 cpus.

I don't think you're doing anything wrong - but the actual work done in each iteration (1 addition + 1 assignment) is just not expensive enough to outweigh the overhead from splitting and mapping the input and marshalling the extra threads. Try adding time.sleep(0.001) (perhaps with a smaller input size) after each iteration in fun1 - suddenly you'll see the pool finishing faster — Mathias R. Jessen
– Mathias R. Jessen, Commented Nov 27, 2024 at 14:29
No, that's not what I'm saying - everything is working as intended in your posted sample - but the cumulative overhead from everything that goes into bootstrapping and managing the threadpool simply exceeds the trivial cost of the original workload - splitting the input list into 2 likely contributes to that overhead, although I doubt it's the biggest factor. For the suggested experiment, here's what I meant in terms of changes to fun1: gist.githubusercontent.com/IISResetMe/… — Mathias R. Jessen
– Mathias R. Jessen, Commented Nov 27, 2024 at 14:55
sleep() isn't CPU-bound, so using multiple processors in this case won't give you anything other than more scheduling overhead :) For a CPU-bound workload, do something that requires many CPU instructions in quick succession: a simple (=unoptimized) primality test over large integers, multiplying big matrices, long division to approximate the first 10000 digits of phi, etc. — Mathias R. Jessen
– Mathias R. Jessen, Commented Nov 27, 2024 at 16:27
Multiprocessing is expensive here because all objects sent to workers are serialized (using pickle), then sent sequentially to each worker using an inter-process-communication (IPC) method (generally memory-bound for large datasets), read by each the worker (which can likely starve for data to compute), then unserialized (unpickling) before just doing a basic addition. Cost(pickling+IPC+unpickling) > Cost(addition). — Jérôme Richard
– Jérôme Richard, Commented Nov 27, 2024 at 20:43
Multithreading can help to avoid this problem but it is limited due to the GIL which prevent any speed up for computational operations unless it is disabled. It can be disabled by a C module when no Python objects are manipulated (quite rare). Alternatively, it can be experimentally disabled globally on very-recent version of Python. However, this introduce some additional overhead (eg. 5% slowdown of sequential codes) and is not supported by all native modules either. It might cause issues (eg. crash) with some modules, so use at your own risks. — Jérôme Richard
– Jérôme Richard, Commented Nov 27, 2024 at 20:48

pippo1980 · Accepted Answer · 2024-12-10 21:29:28Z

I modified your fun1:

import multiprocessing
import time
import random

def fun1( x ):
    y = 0
    for i in range( len( x ) ):
        
        if 0.3 < x[i] < 0.5:
            y = y + x[i]
            
        
        elif x[i] >= 0.5:
            
            y += x[i]**2
            
        else:
            
            y += 1
        
    return( y )

def fun2( x ):
    p = multiprocessing.Pool( 2 )
    y1, y2 = p.map( fun1, [ x[ : len( x ) // 2 ], x[ len( x ) // 2 : ] ] )
    y = y1 + y2
    return( y )

x = [ random.random() for i in range( 10 ** 6 ) ]

st = time.time()
ans = fun1( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

st = time.time()
ans = fun2( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

x = [ random.random() for i in range( 10 ** 7 ) ]

st = time.time()
ans = fun1( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

st = time.time()
ans = fun2( x )
et = time.time()
print( f"time = {et - st}, answer = {ans}." )

output, seems reproducible to me (I've a small CPU):

time = 0.5544326305389404, answer = 671883.5322289083.
time = 0.5338046550750732, answer = 671883.5322288958.
time = 4.955313205718994, answer = 6715366.44807873.
time = 3.6769814491271973, answer = 6715366.448078716.

#anoter try:

time = 0.5013079643249512, answer = 671620.2326088196.
time = 0.39491748809814453, answer = 671620.2326088334.
time = 4.901585578918457, answer = 6717295.294786744.
time = 3.776726484298706, answer = 6717295.294786792.

#next:

time = 0.4863710403442383, answer = 671647.0603678812.
time = 0.6218230724334717, answer = 671647.0603678576.
time = 5.027068376541138, answer = 6716671.888051354.
time = 3.752805471420288, answer = 6716671.888051517.

I think it is related to Understanding time complexity with Python examples

but I could be wrong. I am trying to learn too.

Addendum

As per a previous good answer that was deleted:

using Multiprocessing.Pool().map as from Multiprocessing in Python - all about pickling

Multiprocessing in Python is flexible. You can either define Processes and orchestrate them as you wishes, or use one of excellent methods herding Pool of processes. By default Pool assumes number of processes to be equal to number of CPU cores, but you can change it by passing processes parameter. Main methods included in Pool are apply and map, which let you run process with arbitrary arguments or execute parallel map, respectively. There are also asynchronous versions of these, i.e. apply_async and map_async.Simple, right? Yes, this is all what's needed. Now, go and use multiprocessing!

Actually, before you leave to follow your dreams there's a small caveat to this. When executing processes Python first pickles these methods. This create a bottleneck as only objects that are pickle will be passed to processes. Moreover, Pool doesn't allow to parallelize objects that refer to the instance of pool which runs them.

Yes, this is all because of Pickle. What can you do with it? In a sense, not much, as it's the default battery-included solution. On the other, pickle is generally slow

Deleted answer:

You are copying considerably large amount of data (>90MB), so fun2 is consuming a lot of time for data copying (serialize/deserilaize, inter-process communication),
>>> import pickle
>>> import random
>>> datea = [random.random() for i in range(10 ** 7)]
>>> len(pickle.dumps(data)
90032371
Instead of passing large data directly, you can store your data in multiple files, and distribuite file names to workers.

You may want to use parallel computing framework (such as Polars, Dask and Ray), depending on your workload.

jackal · Accepted Answer · 2024-12-04 08:04:10Z

1

When multiprocessing, your list has to be serialised. This is achieved using pickle. When the length of your list is 10**7 the in-memory size of the list is 89,095,160. The serialised object is slightly larger at 90,032,371. That 90MB object is written to disk (in some temporary location) then consumed prior to execution of your function. That is a relatively slow process.

You can improve matters slightly by not serialising to disk. Instead you can use shared memory although that still won't be as fast as the linear execution - i.e., when fun1() is called directly

Here's a variation of your code that demonstrates the 3 techniques, prints the object sizes and outputs the various timings:

import multiprocessing
import time
import random
from multiprocessing.shared_memory import SharedMemory
import pickle
import sys


def fun1(x):
    return sum(x)


def fun2(x):
    with multiprocessing.Pool() as p:
        data = [x[: len(x) // 2], x[len(x) // 2 :]]
        return sum(p.map(fun1, data))


def fun3(name):
    shm = SharedMemory(name)
    try:
        x = pickle.loads(shm.buf)
        return fun1(x)
    finally:
        shm.close()


if __name__ == "__main__":
    x = [random.random() for i in range(10**7)]
    size = sys.getsizeof(x)
    print(f"In-memory {size=:,}")
    try:
        smb = pickle.dumps(x)
        size = len(smb)
        print(f"Pickled object {size=:,}")
        shm = SharedMemory(create=True, size=size)
        shm.buf[:size] = smb
        for func in fun1, fun2:
            st = time.perf_counter()
            ans = func(x)
            et = time.perf_counter()
            print(func.__name__, f"time = {et - st}, answer = {ans}.")
        st = time.perf_counter()
        ans = fun3(shm.name)
        et = time.perf_counter()
        print(fun3.__name__, f"time = {et - st}, answer = {ans}.")
    finally:
        shm.close()
        shm.unlink()

Output:

In-memory size=89,095,160
Pickled object size=90,032,371
fun1 time = 0.21793299994897097, answer = 4999444.726535229.
fun2 time = 0.6652624170528725, answer = 4999444.726535082.
fun3 time = 0.4321752089308575, answer = 4999444.726535229.

Note:

Your results may differ due to platform and Python version differences.

Here the platform is:

Python 3.13.0
MacOS 15.1.1
Apple M2

answered Dec 4, 2024 at 8:04

jackal

29.1k3 gold badges9 silver badges28 bronze badges

7 Comments

Jérôme Richard Dec 16, 2024 at 23:33

"That 90MB object is written to disk" I am not convinced by this. AFAIK this is not true and done in-memory. Do you have an official source confirming this claim? (note that on systems like Linux or Mac a file might be used for IPC, but data are not physically stored on any storage devices; the kind of IPC method used can impact performance independently of the actual physical kind of storage used)

jackal Dec 17, 2024 at 7:01

@JérômeRichard I had written a separate test (which I'm happy to share with you) where I used implicit pickling versus explicit use of shared memory to pass data to a subprocess. The latter test proved to be significantly faster than the former. I deduced that this was probably due to disk I/O. That may be a flawed assertion. If I am incorrect then there's something seriously wrong with Python's pickling strategy / implementation.

pippo1980 Dec 17, 2024 at 10:05

Is it relevant to topic Why does python multiprocessing pickle objects to pass objects between processes?

Jérôme Richard Dec 21, 2024 at 13:56

The timings does not prove data are written to the disk. In fact, I profiled this on my Linux and for 10 run of fun2 and fun3, I measured a SSD throughput of <0.05 MiB/s (certainly due to modules being loaded). This prove that, on Linux, with multiprocessing, nearly nothing is read/store from/to the storage device!

Jérôme Richard Dec 21, 2024 at 14:11

Actually, I think your benchmark is biased : multiprocessing must pickle+unpickle data and your shared-memory benchmark only measure the time to unpickle it but not the time to pickle/dump data. On my Linux machine, fun2 takes 0.60 sec, fun3 takes 0.32 sec, pickle.dumps takes 0.13 sec, and the shared-memory copy takes 0.04 sec. In the end, the complete shared-memory timing is 0.32+0.13+0.04=0.49 sec (as opposed to 0.60. This is still faster but not so much : only 1.22 times faster.

|

Collectives™ on Stack Overflow

I used a 2 workers pool to split a sum in a function with Python multiprocessing but the timing doesn't speed up, is there something I'm missing?

2 Answers 2

Addendum

Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Addendum

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related