Applying task based parallelism using OpenMP

Question

I have the following code which I am trying to parallelize using OpenMP.

int ncip(int dim, double R){
int n, r = (int)floor(R);

if (dim == 1) return 1 + 2*r; 

#pragma omp task shared(n, dim)
n = ncip(dim-1, R); // last coord 0

for(int i=1; i<=r; ++i){   
    #pragma omp task shared(n, dim)
    n += 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i

}
return n;
}

I need to apply task based parallelism because of the recursive calls but i'm not showing any speedup in my computation. What am I doing wrong ? Any suggestions to help speedup this calculation ?

think about this, you have only 8 threads, and the most computationally intensive part is sqrt(R*R - i*i), but on top of that, you add overhead induced by multithreading itself (creating/syncrhonizing threads). also, putting a variable inside the shared clause doesn't automatically make it safe in terms of concurrent access — Piotr Skotnicki
– Piotr Skotnicki, Commented May 25, 2016 at 16:52
coliru.stacked-crooked.com/a/29ae094732e2e56a , does this work ? — Piotr Skotnicki
– Piotr Skotnicki, Commented May 25, 2016 at 17:26
@PiotrSkotnicki Yes! this works for me, can you explain why this works ? — Mutating Algorithm
– Mutating Algorithm, Commented May 25, 2016 at 17:43

Piotr Skotnicki · Accepted Answer · 2016-05-25 17:47:13Z

Parallelism isn't for free, and so, however innocently a simple pragma looks like, e.g. #pragma omp task, it comes at a significant cost, because it hides the entire logic of creating and synchronizing threads, assigning and queueing tasks, etc. Only if you find a balance between the intensity of computations, the expense of multithreading itself, and the size of the problem, (not to mention side-effects of multithreading, like false-sharing etc.), you will observe positive (>1) speed-up.

Also, keep in mind that the number of threads is always limited. Once you already created enough workload for each thread, don't try to boost your code by adding additional work-sharing constructs - a thread cannot magically divide into two separate instruction flows. That is, if you have a top-most loop that is already parallel, and it has enough iterations to keep all available threads busy, you won't gain anything trying to extract nested parallelism.

Having said that, unless you can utilize some other techniques, like memorizing partial results, or removing recursion altogether, then just use a single top-most parallel loop, and a reduction clause to ensure thread-safe access to the shared variable:

#pragma omp parallel for reduction(+:n)
for (int i = 1; i <= r; ++i)
{
    n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
}

and then a plain sequential function:

int ncip(int dim, double R)
{
    int n, r = (int)floor(R);

    if (dim == 1)
    {
        return 1 + 2*r; 
    }

    n = ncip(dim-1, R);

    for (int i = 1; i <= r; ++i)
    {   
        n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
    }

    return n;
}

DEMO

Collectives™ on Stack Overflow

Applying task based parallelism using OpenMP

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related