Using openmp better

Question

I am trying to implement openMP, but like so many other posters before me, the result has simply been to slow the code down. Inspired by previous answers, I went from using #pragma omp parallel for to #pragma omp task, in the hope that I could avoid some overhead. Unfortunately, the parallelized code is still twice as slow as the serial. From other answers, it seems that the proper procedure depends on the specific demands of the code, which is why I thought I would have to ask a question myself.

First the pseudo-code:

#pragma omp parallel
{
#pragma omp master
while (will be run some hundreds of millions of times)
{
    for (between 5 and 20 iterations)
    {
        #pragma omp task
        (something)
    }
    #pragma omp taskwait <- it is important that all the above tasks are completed before going on

    (something)

    if (something)
    {
        (something)

        for (between 50 and 200 iterations)
        {
            #pragma omp task 
            (something)
        }
        #pragma omp taskwait

        (something)
    }

}
}

Only the two for-loops can be parallelized, the rest must be done in the right order. I came up with putting the parallel and master directives outside the while-loop in an attempt at reducing the overhead of creating the team.

I am also a bit curious whether I am using taskwait properly - the specification states that the "parent task" is put on hold until all child tasks have been executed, but it is not quite clear whether that terminology also applies here, where the task regions are not nested.

Can anyone come up with a better way of using openMP, such that I may actually get a speed-up?

EDIT: each step in the while-loop depends on all previous steps, and thus they have to be done serially, with an update at the end. It is an implementation of an "event-driven-algorithm" for simulating neural networks, if anyone was wondering.

how long does each iteration of the for loop take? if the tasks are to small it is quite possible that it is simply not possible to get a speedup here. Besides why would #pragma omp task be faster then #pragma omp for? Afterall the later should be able to get away with much less managing overhead. To me it seems that if it is faster you probably used the wrong scheduling mode for your situation. About taskwait: As I understand it the master section should be your parent task (or maybe the parallel section but that seems unlikely) — Grizzly
– Grizzly, Commented Feb 19, 2012 at 19:43
I got the idea that tasks would be faster because an answer to an old question said something along the lines of "if there are too few iterations in the for-loop, you are better off using tasks instead". In the serial case, it is possible to go through 10000 iterations of the while loop in 1.7 seconds. Considering the other settings, a ball-park estimate would be 1.0-0.5 microseconds for each iteration of the second for-loop. I know it's short, but was told that I was underestimating the power of parallelization, and decided to give it a shot :) — Kaare
– Kaare, Commented Feb 19, 2012 at 19:54
It really sounds like you need to think about either a new algorithm, or a new parallel processing paradigm, or maybe even both. — talonmies
– talonmies, Commented Feb 19, 2012 at 20:07
The reason tasks might be faster for few iterations is better load balancing if the execution times of the iteration vary widly. however you should be able to get the that effect using dynamic scheduling too. 1.0µs/task does seem a bit low for parallelization to have a positive effect. I would expect the overhead for tasks to be in the ballpark of a few thousand clocks, so around a microsecond. Afterall atomics, moving things to different caches, those things aren't that cheap. — Grizzly
– Grizzly, Commented Feb 19, 2012 at 20:17
@talonmies a different parallel processing paradigm, what would that be? — Kaare
– Kaare, Commented Feb 19, 2012 at 20:43

Grizzly · Accepted Answer · 2012-02-19 22:54:13Z

2

For parallel programming you should also design your problem in a way such that you rarely need to sync your threads. Each time you sync your threads you will get the worst performance of all threads. If you need to sync your threads, try to redesign your problem, to avoid these syncs.

Tweaking your code from #pragma omp parallel for to #pragma omp task won't get you any significant improvments, as their execution time difference is normally neglectable. Before trying to tweak some routine calls or omp statments you need to adapt your problem to parallel execution. You need really think in "parallel" to get a good and scalable performace increase, just adapting serial code rarely works.

In your code you should try to parallize the while loop and not the inner for loops. If you parallize the small for loop you will not get any significant performance increase.

edited Feb 19, 2012 at 22:54

Grizzly

20.3k4 gold badges66 silver badges79 bronze badges

answered Feb 19, 2012 at 19:28

tune2fs

7,7237 gold badges44 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kaare Over a year ago

I am afraid that is completely impossible. I realize that it is not apparent in the above, but each iteration in the while loop depends on the previous ones, so they have to be done sequentially.

tune2fs Over a year ago

then try to redesign your problem, to get independent loops. Otherwise you will waste your time tweaking code which did not perform well in parallel.

martiert · Accepted Answer · 2012-02-19 19:31:11Z

0

I'm not sure if a task is the right way to go here. I'm not to familiar with tasks, but it seems like it starts a thread each time you encounter a #pragma omp task. I would rather try something like:

while (will be run some hundreds of millions of time)
{
#pragma omp parallel
{
    for (between 5 and 20 iterations)
    {
        (something) 
    }
#pragma omp single/master
{

    (something)
    bool flag = false;
    if (something)
    {
        (something)
        flag = true;
    }
}

    if (flag)
    {
        for (between 50 and 200 iterations)
        {
            (something)
        }
    }
#pragma omp single/master
{
            (something)
}
    }
    }

It's also important to remember that the tasks in the for loops might be to small for parallel execution to give any speedup, as there is an overhead in starting and syncing threads. You should also look at the possibility of rewriting your program so you don't need to synchronize your threads, which you right now do quite a lot. My guess is that your algorithm and workload is actually to small for parallel execution to give any speedup as it is written right now.

answered Feb 19, 2012 at 19:31

martiert

7414 silver badges13 bronze badges

1 Comment

Grizzly Over a year ago

That would do each iteration n times (with n beeing the number of threads), so there is really no benefit. Using tasks will not start new thread, instead it will use threads of the current team, which are currently on hold (not exactly true but close enough)

pyCthon · Accepted Answer · 2012-02-19 20:06:15Z

0

did you remember to set your enviroment variables accordingly? OMP_NUM_THREADS = N , where N is the number of threads or cores supported by your processor

answered Feb 19, 2012 at 20:06

pyCthon

12.5k23 gold badges81 silver badges143 bronze badges

3 Comments

Kaare Over a year ago

I have not touched the environment variables at all, but let "#pragma omp parallel" figure it out by itself. my various versions of "hello world" have shown the correct number of outputs (2).

Kaare Over a year ago

what do you mean? the above? yes, if I don't supply the "-fopenmp" to g++, the code will be run in serial. that is how I can say that the parallel version is too slow. did I understand your question correctly?

pyCthon Over a year ago

sorry i meant to ask if you profiled it with anything. from the looks of the situation it sounds like its either your processor which is less likely or your code has some race condition or sync issues with it thats making the openMP slower

Collectives™ on Stack Overflow

Using openmp better

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related