0

I am using a library that is already parallelized with OpenMP. The issue is that 2-4 cores seem enough for the processing it is doing. Using more than 4 cores makes little difference.

My code is like this:

for (size_t i=0; i<4; ++i)
    Call_To_Library (i, ...);

Since 4 cores seem enough for the library (i.e, 4 cores should be used in Call_To_Library), and I am working with a 16 cores machine, I intend to also parallelize my for loop. Note that this for consists at most of 3-4 iterations.

What would be the best approach to parallelize this outer for? Can I also use OpenMP? Is it a best practice to use nested parallelizations? The library I am calling already uses OpenMP and I cannot modify its code (and it wouldn't be straightforward anyway).

PS. Even if the outer loop consists only of 4 iterations, it is worth parallelizing. Each call to the library takes 4-5 seconds.

1 Answer 1

2

If there is no dependency between iterations of this loop you can do:

 #pragma omp for schedule(static)
 for (size_t i=0; i<4; ++i)
    Call_To_Library (i, ...);

If, as you said, every invocation of Call_To_Library takes such a big amount of time the overhead of having nested OpenMP operators will probably be negligible.

Moreover you say that you have no control over the number of openmp threads created in Call_To_Library. This solution will multiply the number of openmp threads by 4 and most likely you will see a 4x speedup. Probably the inner Call_To_Library was parallelized in such a way that no more than a few openmp threads could be executed at the same time. With the external parallel for you increase that number 4 times.

The problem with nested parallelism could be that you have an explosion of the number of threads created at the same time and therefore you could see less than ideal speedup because of the overhead related to creation/closing of openmp threads.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, I guess I also need to call openmp_set_nested(). How can I make sure the outer for is creating only 4 threads and the call to the library other 4 threads? I.e., 4x4=16 threads.
You can control the number of openmp worker threads with the env variable OMP_NUM_THREADS.
OK, I think I can use OMP_NUM_THREADS environment variable to control the worker threads of the library. For the outer loop I can use a num_threads clause (maybe in the future I want a 8x2 threads scheme).
I added to my answer. Moreover you can use omp_get_num_threads() within the parallel for region to check how many threads are actually running.
I don't undestand second scenario. Even if the library is badly parallelized, it will still use only 4 cores. Even in this scenario I should see a 4x speedup. Note that each call to the library is processing an independent chunk of data.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.