OpenMP nested for, unequal num. of iterations

Question

I am using OpenMP to parallelize loops. In normal case, one would use:

#pragma omp for schedule(static, N_CHUNK)
for(int i = 0; i < N; i++) {
    // ...
}

For nested loops, I can put pragma on the inner or outter loop

#pragma omp for schedule(static, N_CHUNK) // can be here...
for(int i = 0; i < N; i++) {
#pragma omp for schedule(static, N_CHUNK) // or here...
    for(int k = 0; k < N; k++) {
    // both loops have consant number of iterations
    // ...
    }
}

But! I have two loops, where number of iterations in 2nd loop depends on the 1st loop:

for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

What is the best way to balance CPU usage for this kind of loop?

sehe · Accepted Answer · 2011-11-04 11:55:14Z

5

As always:

it depends
profile.
In this case: see also OMP_NESTED environment variable

The things that are going to make the difference here are not being shown:

(non)linear memory addressing (also watch the order of the loops
use of shared variables;

As to your last scenario:

for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

I suggest parallelizing the outer loop for the following reasons:

all other things being equal coarse grained parallelizing usually leads to better performance due to
- increased cache locality
- reduced frequency of locking required (note that this hinges on assumptions about the loop contents that I can't really make; I'm basing it on my experience of /usual/ parallelized code)
the inner loop might become so short as to be inefficient to parallelize (IOW: the outer loop's range is predictable, the inner loop less so, or doesn't lend itself to static scheduling as well)
nested parallellism rarely scales well

answered Nov 4, 2011 at 11:55

sehe

400k49 gold badges475 silver badges673 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jakub M. Over a year ago

I just learned about collapse and applied it. Do you know if it scales well? "coarse grained parallelizing usually leads to better performance" does schedule(..., CHUNK) solve issue of the parallelization granularity?

sehe Over a year ago

@JakubM.: scheduling is a tool to prevent fragmented thread tasks. However, it alone is not enough to prevent costly threading overhead. My general rule of thumb is to not use nested OMP parallel sections. After all, there are only so-many cores anyway. So instead of with each loop thinking: 'could this be parallel' I look at my program highlevel functionality and think 'what big tasks can I parallelize'? I guess this implies I only apply OMP on CPU crunching tasks (true). There might be more subtle judgements to make when doing e.g. an IO handler with many short but parallelizable tasks.

Jonathan Dursi · Accepted Answer · 2011-11-04 17:21:15Z

3

sehe's points -- especially "it depends" and "profile" -- are extremely to the point.

Normally, though, you wouldn't want to have the nested parallel loops as long as the outer loop is big enough to keep all cores busy. The added overhead of another parallel section inside a loop is probably more cost than the benefit from the additional small pieces of work.

The usual way to tackle this is just to schedule the outer loop dynamically, so that the fact that each loop iteration takes a different length of type doesn't cause load-balancing issues (as the i==N-1 iteration completes almost immediately while the i==0 iteration takes forever)

#pragma omp parallel for default(none) shared(N) schedule(dynamic)
for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

The collapse pragma is very useful for essentially getting rid of the nesting and is particularly valuable if the outer loop is small (eg, N < num_threads):

#pragma omp parallel for default(none) shared(N) collapse(2)
for(int i = 0; i < N; i++) {
    for(int k = 0 ; k < N; k++) {

    }
}

This way the two loops are folded into one and there is fewer chunking which means less overhead. But that won't work in this case, because the loop ranges aren't fixed; you can't collapse a loop whose loop bounds change (eg, with i).

answered Nov 4, 2011 at 17:21

Jonathan Dursi

51.1k10 gold badges131 silver badges160 bronze badges

2 Comments

Jakub M. Over a year ago

@J.D.: "work in this case, because the loop ranges aren't fixed" I overcame that by fixing ranges and checking if (i > k) continue at the beginning of the loop's code. Dirty trick, maybe not the best solution in this case.

Oliver Grant Over a year ago

Rather than the "dynamic" option you can keep the preferred "static" and specify a small chunk-size: If you specify chunk-size variable, the iterations will be divide into iter_size / chunk_size chunks. And the tasks will be assigned to the threads in circular order. So in your case if you have 4 threads, N is 100 and you give a chunk_size of 5, thread0 will get tasks 1-5, 25-30, etc

Collectives™ on Stack Overflow

OpenMP nested for, unequal num. of iterations

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related