Why is the following simple parallelized code much slower than a simple loop in Python?

Question

A simple program which calculates square of numbers and stores the results:

    import time
    from joblib import Parallel, delayed
    import multiprocessing

    array1 = [ 0 for i in range(100000) ]

    def myfun(i):
        return i**2

    #### Simple loop ####
    start_time = time.time()

    for i in range(100000):
        array1[i]=i**2

    print( "Time for simple loop         --- %s seconds ---" % (  time.time()
                                                               - start_time
                                                                 )
            )
    #### Parallelized loop ####
    start_time = time.time()
    results = Parallel( n_jobs  = -1,
                        verbose =  0,
                        backend = "threading"
                        )(
                        map( delayed( myfun ),
                             range( 100000 )
                             )
                        )
    print( "Time for parallelized method --- %s seconds ---" % (  time.time()
                                                               - start_time
                                                                 )
            )

    #### Output ####
    # >>> ( executing file "Test_vr20.py" )
    # Time for simple loop         --- 0.015599966049194336 seconds ---
    # Time for parallelized method --- 7.763299942016602 seconds ---

Could it be the difference in array handling for the two options? My actual program would have something more complicated but this is the kind of calculation that I need to parallelize, as simply as possible, but not with such results.

System Model: HP ProBook 640 G2, Windows 7,
              IDLE for Python System Type: x64-based PC Processor:
              Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz,
              2401 MHz,
              2 Core(s),
              4 Logical Processor(s)

It may be a fair comparison if you compared map(myfun, range(100000)) with the parallel code. Oh and btw, i ^ 2 is i XOR 2, if you want to square i you need i ** 2. — knbk
– knbk, Commented Oct 13, 2017 at 10:07
Thanks; corrected the ^. How do I otherwise compare the non parallel simple execution with the quickest possible parallel computation? — DS R
– DS R, Commented Oct 13, 2017 at 10:14
Function calls in python have a significant overhead, just compare your first loop without a function call to map(myfun, range(100000)), without any parallelism. If you're not using that function in the non-parallel code, the function call overhead will highly skew the results. Either way, multiprocessing is a trade-off, and you won't know if it's a good trade-off until you've tried and measured it with your actual program. — knbk
– knbk, Commented Oct 13, 2017 at 10:24
That info about function calls was useful and I'm trying things with the map(). Actually I was trying to get this sorted first because this seems like a low hanging fruit (compared to the much more involved calculation I need) but it may not be so I believe now. I thought there should be some facility where I can assign each processor core to some parts of the for loop. Say core 0 for index 0 to 3, core 1 for 4 to 7 etc. — DS R
– DS R, Commented Oct 13, 2017 at 10:29

halfer · Accepted Answer · 2024-04-28 22:09:31Z

6

Why? Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

I love Python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.

A few facts:

No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further

To have some smell of what can be done in standard python 2.7.13:

Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:

You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1

>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
 9751L
10146L
10625L
 9942L
10346L
 9359L
10473L
 9171L
 8328L

_{( the Stopwatch().stop() method yields [us] from .start() )}
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:

>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
   15L
   22L
   21L
   23L
   19L
   22L

>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
   43L
   47L
   42L
   44L
   47L

So, using the proper tools just starts the story of performance:

>>> def test_SERIAL_python( nLOOPs = 100000 ):
...     aClk.start()
...     for i in xrange( nLOOPs ):           # py3 range() ~ xrange() in py27 
...         array1[i] = i**2                 # your loop-code
...     _ = aClk.stop()
...     return _

While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:

>>> test_SERIAL_python( nLOOPs = 100000 )
 70318L
 69211L
 77825L
 70943L
 74834L
 73079L

Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER

>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L

and with another glitch, a.k.a. an inplace modus-operandi:

>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L

Yields ~ +514x SPEEDUP, just from using appropriate tool

The art of performance is not in following marketing-sounding claims
about parallellizing-( at-any-cost ), but in using know-how based methods, that pay least costs for biggest speedups achievable.

For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speed-ups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speed-ups of 0.9 or even << 1.0 are so often reported here, on Stack Overflow, that you need not feel lost or alone in this sort of surprise ).

Epilogue

Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter
( numpy et al - thanks a lot Travis OLIPHANT & al ... great salute to his team ... )

As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speed-ups << 1.0

Overhead-strict formulation of the Amdahl's Law speed-up S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N

( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )

edited Apr 28, 2024 at 22:09

halfer

20.2k20 gold badges110 silver badges207 bronze badges

answered Oct 13, 2017 at 11:50

user3666197

1

Sign up to request clarification or add additional context in comments.

5 Comments

DS R Over a year ago

Thank you so much for such a rich and "wholesome" answer. I'll take some time to digest it!

user3666197 Over a year ago

Glad you feel attracted by the argumentation. Feel free to experiment with the animation interactive sliders to see how soon the realistic overheads kill the positive effect from indeed any amount of N-CPU-s. That is the most important part of the story on performance.

DS R Over a year ago

In coarse terms, as I understand, adding n cores gives you only n times increase (n is usually < 10 for most users), while as you showed, algorithm and data structure improvements can give 1000x or more with some effort, to add to it, multi processing has work division and combination overhead.

user3666197 Over a year ago

Let me follow the logic -- having N-workers may help to split, but only the [PARALLEL]-fraction of the whole process-task S, leaving the [SERIAL]-fraction last the same amount of time, because it is principally [SERIAL] and any amount of those N-workers simply have to wait, until the [SERIAL]-part is finished. Plus, you always have to pay the costs -- here the set of all initial and terminal overheads -- be it processing-related, memory-management related, workpackage-distribution related or whatever other part of the nature call for being paid in time for a [PARALLEL]-fraction to run.

user3666197 Over a year ago

So, if the theoretically ideal-split of the [PARALLEL]-fraction over N-workers makes such [PARALLEL]-part of the process duration from say 5 seconds down to a 1 second ( which 5:1 theoretically would look like a brave step ) but if you have to pay ( for achieving this type of processing arrangement ) a cost of +4 seconds, you do not get any speedup here. Similarly, if you happen to accelerate ( be it using smarter tools, more efficient data-layouts, etc ), the [SERIAL]-part not to last 5 but just 4 seconds ( without any additional overhead ), you have net 10% speedup straight.

Martin Bonner supports Monica · Accepted Answer · 2017-10-13 13:13:43Z

4

From the documentation of threading:

If you know that the function you are calling is based on a compiled extension that releases the Python Global Interpreter Lock (GIL) during most of its computation ...

The problem is that in the this case, you don't know that. Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).

threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.

The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.

If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back. The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it. Measure and see.

edited Oct 13, 2017 at 13:13

answered Oct 13, 2017 at 10:03

Martin Bonner supports Monica

29.1k3 gold badges56 silver badges91 bronze badges

6 Comments

DS R Over a year ago

Thanks for the answer. Does this mean that to speed up such calculations, I need to do something at the command line? Or can I still speed up such calculations from within my code?

user3012759 Over a year ago

That means that to speed up such calculation you need to use 'not python'... threading does not speed up things like that in python, and multiprocessing is expensive as hell and communication is not as easy as with threading, so you either bite the bullet and run single-thread or find an alternative language/python implementation.

Ramast Over a year ago

u didn't explain your use case. are you really trying to run a threat just to calculate an element's square root?

Martin Bonner supports Monica Over a year ago

@Ramast : No of course not. The question actually says " My actual program would have something more complicated", but being a good question writer, he has snipped out the irrelevant detail and replaced it with something simple.

user3666197 Over a year ago

@Ramast it would be rather fair to pre-calculate, what is the minimum amount of processing that ought happen "inside" the fatFUN(), to at least "adjust / cover" both the { [PAR]-Processing-Setup- | [PAR]-Processing-Terminate- }-overheads. Having this quantitatively supported coefficient, your claim could be better supported, that starting with a just "if myfun() is doing a lot of processing", couldn't it? This is where any serious HPC-design practice just has to start.

|

Collectives™ on Stack Overflow

Why is the following simple parallelized code much slower than a simple loop in Python?

2 Answers 2

Why? Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

A few facts:

To have some smell of what can be done in standard python 2.7.13:

So, using the proper tools just starts the story of performance:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER

Yields ~ +514x SPEEDUP, just from using appropriate tool

Epilogue

5 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Why? Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:

A few facts:

To have some smell of what can be done in standard python 2.7.13:

So, using the proper tools just starts the story of performance:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER

Yields ~ +514x SPEEDUP, just from using appropriate tool

Epilogue

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Why? Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:

Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER