OpenMP nested parallelization

Question

I am using a library that is already parallelized with OpenMP. The issue is that 2-4 cores seem enough for the processing it is doing. Using more than 4 cores makes little difference.

My code is like this:

for (size_t i=0; i<4; ++i)
    Call_To_Library (i, ...);

Since 4 cores seem enough for the library (i.e, 4 cores should be used in Call_To_Library), and I am working with a 16 cores machine, I intend to also parallelize my for loop. Note that this for consists at most of 3-4 iterations.

What would be the best approach to parallelize this outer for? Can I also use OpenMP? Is it a best practice to use nested parallelizations? The library I am calling already uses OpenMP and I cannot modify its code (and it wouldn't be straightforward anyway).

PS. Even if the outer loop consists only of 4 iterations, it is worth parallelizing. Each call to the library takes 4-5 seconds.

igon · Accepted Answer · 2014-11-10 20:04:07Z

2

If there is no dependency between iterations of this loop you can do:

 #pragma omp for schedule(static)
 for (size_t i=0; i<4; ++i)
    Call_To_Library (i, ...);

If, as you said, every invocation of Call_To_Library takes such a big amount of time the overhead of having nested OpenMP operators will probably be negligible.

Moreover you say that you have no control over the number of openmp threads created in Call_To_Library. This solution will multiply the number of openmp threads by 4 and most likely you will see a 4x speedup. Probably the inner Call_To_Library was parallelized in such a way that no more than a few openmp threads could be executed at the same time. With the external parallel for you increase that number 4 times.

The problem with nested parallelism could be that you have an explosion of the number of threads created at the same time and therefore you could see less than ideal speedup because of the overhead related to creation/closing of openmp threads.

edited Nov 10, 2014 at 20:04

answered Nov 10, 2014 at 19:05

igon

3,0661 gold badge25 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

sapito Over a year ago

Thanks, I guess I also need to call openmp_set_nested(). How can I make sure the outer for is creating only 4 threads and the call to the library other 4 threads? I.e., 4x4=16 threads.

igon Over a year ago

You can control the number of openmp worker threads with the env variable OMP_NUM_THREADS.

sapito Over a year ago

OK, I think I can use OMP_NUM_THREADS environment variable to control the worker threads of the library. For the outer loop I can use a num_threads clause (maybe in the future I want a 8x2 threads scheme).

igon Over a year ago

I added to my answer. Moreover you can use omp_get_num_threads() within the parallel for region to check how many threads are actually running.

sapito Over a year ago

I don't undestand second scenario. Even if the library is badly parallelized, it will still use only 4 cores. Even in this scenario I should see a 4x speedup. Note that each call to the library is processing an independent chunk of data.

|

Collectives™ on Stack Overflow

OpenMP nested parallelization

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related