2

I am trying to implement openMP, but like so many other posters before me, the result has simply been to slow the code down. Inspired by previous answers, I went from using #pragma omp parallel for to #pragma omp task, in the hope that I could avoid some overhead. Unfortunately, the parallelized code is still twice as slow as the serial. From other answers, it seems that the proper procedure depends on the specific demands of the code, which is why I thought I would have to ask a question myself.

First the pseudo-code:

#pragma omp parallel
{
#pragma omp master
while (will be run some hundreds of millions of times)
{
    for (between 5 and 20 iterations)
    {
        #pragma omp task
        (something)
    }
    #pragma omp taskwait <- it is important that all the above tasks are completed before going on

    (something)

    if (something)
    {
        (something)

        for (between 50 and 200 iterations)
        {
            #pragma omp task 
            (something)
        }
        #pragma omp taskwait

        (something)
    }

}
}

Only the two for-loops can be parallelized, the rest must be done in the right order. I came up with putting the parallel and master directives outside the while-loop in an attempt at reducing the overhead of creating the team.

I am also a bit curious whether I am using taskwait properly - the specification states that the "parent task" is put on hold until all child tasks have been executed, but it is not quite clear whether that terminology also applies here, where the task regions are not nested.

Can anyone come up with a better way of using openMP, such that I may actually get a speed-up?

EDIT: each step in the while-loop depends on all previous steps, and thus they have to be done serially, with an update at the end. It is an implementation of an "event-driven-algorithm" for simulating neural networks, if anyone was wondering.

8
  • how long does each iteration of the for loop take? if the tasks are to small it is quite possible that it is simply not possible to get a speedup here. Besides why would #pragma omp task be faster then #pragma omp for? Afterall the later should be able to get away with much less managing overhead. To me it seems that if it is faster you probably used the wrong scheduling mode for your situation. About taskwait: As I understand it the master section should be your parent task (or maybe the parallel section but that seems unlikely) Commented Feb 19, 2012 at 19:43
  • I got the idea that tasks would be faster because an answer to an old question said something along the lines of "if there are too few iterations in the for-loop, you are better off using tasks instead". In the serial case, it is possible to go through 10000 iterations of the while loop in 1.7 seconds. Considering the other settings, a ball-park estimate would be 1.0-0.5 microseconds for each iteration of the second for-loop. I know it's short, but was told that I was underestimating the power of parallelization, and decided to give it a shot :) Commented Feb 19, 2012 at 19:54
  • It really sounds like you need to think about either a new algorithm, or a new parallel processing paradigm, or maybe even both. Commented Feb 19, 2012 at 20:07
  • 2
    The reason tasks might be faster for few iterations is better load balancing if the execution times of the iteration vary widly. however you should be able to get the that effect using dynamic scheduling too. 1.0µs/task does seem a bit low for parallelization to have a positive effect. I would expect the overhead for tasks to be in the ballpark of a few thousand clocks, so around a microsecond. Afterall atomics, moving things to different caches, those things aren't that cheap. Commented Feb 19, 2012 at 20:17
  • @talonmies a different parallel processing paradigm, what would that be? Commented Feb 19, 2012 at 20:43

3 Answers 3

2

For parallel programming you should also design your problem in a way such that you rarely need to sync your threads. Each time you sync your threads you will get the worst performance of all threads. If you need to sync your threads, try to redesign your problem, to avoid these syncs.

Tweaking your code from #pragma omp parallel for to #pragma omp task won't get you any significant improvments, as their execution time difference is normally neglectable. Before trying to tweak some routine calls or omp statments you need to adapt your problem to parallel execution. You need really think in "parallel" to get a good and scalable performace increase, just adapting serial code rarely works.

In your code you should try to parallize the while loop and not the inner for loops. If you parallize the small for loop you will not get any significant performance increase.

Sign up to request clarification or add additional context in comments.

2 Comments

I am afraid that is completely impossible. I realize that it is not apparent in the above, but each iteration in the while loop depends on the previous ones, so they have to be done sequentially.
then try to redesign your problem, to get independent loops. Otherwise you will waste your time tweaking code which did not perform well in parallel.
0

I'm not sure if a task is the right way to go here. I'm not to familiar with tasks, but it seems like it starts a thread each time you encounter a #pragma omp task. I would rather try something like:

while (will be run some hundreds of millions of time)
{
#pragma omp parallel
{
    for (between 5 and 20 iterations)
    {
        (something) 
    }
#pragma omp single/master
{

    (something)
    bool flag = false;
    if (something)
    {
        (something)
        flag = true;
    }
}

    if (flag)
    {
        for (between 50 and 200 iterations)
        {
            (something)
        }
    }
#pragma omp single/master
{
            (something)
}
    }
    }

It's also important to remember that the tasks in the for loops might be to small for parallel execution to give any speedup, as there is an overhead in starting and syncing threads. You should also look at the possibility of rewriting your program so you don't need to synchronize your threads, which you right now do quite a lot. My guess is that your algorithm and workload is actually to small for parallel execution to give any speedup as it is written right now.

1 Comment

That would do each iteration n times (with n beeing the number of threads), so there is really no benefit. Using tasks will not start new thread, instead it will use threads of the current team, which are currently on hold (not exactly true but close enough)
0

did you remember to set your enviroment variables accordingly? OMP_NUM_THREADS = N , where N is the number of threads or cores supported by your processor

3 Comments

I have not touched the environment variables at all, but let "#pragma omp parallel" figure it out by itself. my various versions of "hello world" have shown the correct number of outputs (2).
what do you mean? the above? yes, if I don't supply the "-fopenmp" to g++, the code will be run in serial. that is how I can say that the parallel version is too slow. did I understand your question correctly?
sorry i meant to ask if you profiled it with anything. from the looks of the situation it sounds like its either your processor which is less likely or your code has some race condition or sync issues with it thats making the openMP slower

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.