0

I have the following code which I am trying to parallelize using OpenMP.

int ncip(int dim, double R){
int n, r = (int)floor(R);

if (dim == 1) return 1 + 2*r; 

#pragma omp task shared(n, dim)
n = ncip(dim-1, R); // last coord 0

for(int i=1; i<=r; ++i){   
    #pragma omp task shared(n, dim)
    n += 2*ncip(dim-1, sqrt(R*R - i*i) ); // last coord +- i

}
return n;
}

I need to apply task based parallelism because of the recursive calls but i'm not showing any speedup in my computation. What am I doing wrong ? Any suggestions to help speedup this calculation ?

16
  • think about this, you have only 8 threads, and the most computationally intensive part is sqrt(R*R - i*i), but on top of that, you add overhead induced by multithreading itself (creating/syncrhonizing threads). also, putting a variable inside the shared clause doesn't automatically make it safe in terms of concurrent access Commented May 25, 2016 at 16:52
  • @PiotrSkotnicki Can you please show me how to do that ? Commented May 25, 2016 at 16:58
  • @PiotrSkotnicki I have been stuck on this for days Commented May 25, 2016 at 16:58
  • 1
    coliru.stacked-crooked.com/a/29ae094732e2e56a , does this work ? Commented May 25, 2016 at 17:26
  • 1
    @PiotrSkotnicki Yes! this works for me, can you explain why this works ? Commented May 25, 2016 at 17:43

1 Answer 1

0

Parallelism isn't for free, and so, however innocently a simple pragma looks like, e.g. #pragma omp task, it comes at a significant cost, because it hides the entire logic of creating and synchronizing threads, assigning and queueing tasks, etc. Only if you find a balance between the intensity of computations, the expense of multithreading itself, and the size of the problem, (not to mention side-effects of multithreading, like false-sharing etc.), you will observe positive (>1) speed-up.

Also, keep in mind that the number of threads is always limited. Once you already created enough workload for each thread, don't try to boost your code by adding additional work-sharing constructs - a thread cannot magically divide into two separate instruction flows. That is, if you have a top-most loop that is already parallel, and it has enough iterations to keep all available threads busy, you won't gain anything trying to extract nested parallelism.

Having said that, unless you can utilize some other techniques, like memorizing partial results, or removing recursion altogether, then just use a single top-most parallel loop, and a reduction clause to ensure thread-safe access to the shared variable:

#pragma omp parallel for reduction(+:n)
for (int i = 1; i <= r; ++i)
{
    n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
}

and then a plain sequential function:

int ncip(int dim, double R)
{
    int n, r = (int)floor(R);

    if (dim == 1)
    {
        return 1 + 2*r; 
    }

    n = ncip(dim-1, R);

    for (int i = 1; i <= r; ++i)
    {   
        n = n + (2 * ncip(dim-1, sqrt(R*R - i*i)));
    }

    return n;
}

DEMO

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.