2

I am using OpenMP to parallelize loops. In normal case, one would use:

#pragma omp for schedule(static, N_CHUNK)
for(int i = 0; i < N; i++) {
    // ...
}

For nested loops, I can put pragma on the inner or outter loop

#pragma omp for schedule(static, N_CHUNK) // can be here...
for(int i = 0; i < N; i++) {
#pragma omp for schedule(static, N_CHUNK) // or here...
    for(int k = 0; k < N; k++) {
    // both loops have consant number of iterations
    // ...
    }
}

But! I have two loops, where number of iterations in 2nd loop depends on the 1st loop:

for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

What is the best way to balance CPU usage for this kind of loop?

0

2 Answers 2

5

As always:

The things that are going to make the difference here are not being shown:

  • (non)linear memory addressing (also watch the order of the loops
  • use of shared variables;

As to your last scenario:

for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

I suggest parallelizing the outer loop for the following reasons:

  • all other things being equal coarse grained parallelizing usually leads to better performance due to

    • increased cache locality
    • reduced frequency of locking required (note that this hinges on assumptions about the loop contents that I can't really make; I'm basing it on my experience of /usual/ parallelized code)
  • the inner loop might become so short as to be inefficient to parallelize (IOW: the outer loop's range is predictable, the inner loop less so, or doesn't lend itself to static scheduling as well)

  • nested parallellism rarely scales well
Sign up to request clarification or add additional context in comments.

2 Comments

I just learned about collapse and applied it. Do you know if it scales well? "coarse grained parallelizing usually leads to better performance" does schedule(..., CHUNK) solve issue of the parallelization granularity?
@JakubM.: scheduling is a tool to prevent fragmented thread tasks. However, it alone is not enough to prevent costly threading overhead. My general rule of thumb is to not use nested OMP parallel sections. After all, there are only so-many cores anyway. So instead of with each loop thinking: 'could this be parallel' I look at my program highlevel functionality and think 'what big tasks can I parallelize'? I guess this implies I only apply OMP on CPU crunching tasks (true). There might be more subtle judgements to make when doing e.g. an IO handler with many short but parallelizable tasks.
3

sehe's points -- especially "it depends" and "profile" -- are extremely to the point.

Normally, though, you wouldn't want to have the nested parallel loops as long as the outer loop is big enough to keep all cores busy. The added overhead of another parallel section inside a loop is probably more cost than the benefit from the additional small pieces of work.

The usual way to tackle this is just to schedule the outer loop dynamically, so that the fact that each loop iteration takes a different length of type doesn't cause load-balancing issues (as the i==N-1 iteration completes almost immediately while the i==0 iteration takes forever)

#pragma omp parallel for default(none) shared(N) schedule(dynamic)
for(int i = 0; i < N; i++) {
    for(int k = i; k < N; k++) {
    // k starts from i, not from 0...
    }
}

The collapse pragma is very useful for essentially getting rid of the nesting and is particularly valuable if the outer loop is small (eg, N < num_threads):

#pragma omp parallel for default(none) shared(N) collapse(2)
for(int i = 0; i < N; i++) {
    for(int k = 0 ; k < N; k++) {

    }
}

This way the two loops are folded into one and there is fewer chunking which means less overhead. But that won't work in this case, because the loop ranges aren't fixed; you can't collapse a loop whose loop bounds change (eg, with i).

2 Comments

@J.D.: "work in this case, because the loop ranges aren't fixed" I overcame that by fixing ranges and checking if (i > k) continue at the beginning of the loop's code. Dirty trick, maybe not the best solution in this case.
Rather than the "dynamic" option you can keep the preferred "static" and specify a small chunk-size: If you specify chunk-size variable, the iterations will be divide into iter_size / chunk_size chunks. And the tasks will be assigned to the threads in circular order. So in your case if you have 4 threads, N is 100 and you give a chunk_size of 5, thread0 will get tasks 1-5, 25-30, etc

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.