parallelizing in openMP

Question

I have the following code that I want to paralleize using OpenMP

for(m=0; m<r_c; m++)
{
    for(n=0; n<c_c; n++)
    {
        double value = 0.0;
        for(j=0; j<r_b; j++)
            for(k=0; k<c_b; k++)
            {
                double a;
                if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                    a = 0.0;
                else
                    a = h_a[((m-j)*c_a) + (n-k)];
                //printf("%lf\t", a);
                value += h_b[(j*c_b) + k] * a;
            }
        h_c[m*c_c + n] = value;
        //printf("%lf\t", h_c[m*c_c + n]);
    }
    //cout<<"row "<<m<<" completed"<<endl;
}

In this I want every thread to perform "for j" and "for k" simultaneouly. I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result. How can I do this in an optimized manner. thanks in advance.

@AviGinsburg its the code for matrix 2D full convolution and i am trying that each thread perform the 'j' and 'k' loops in parallel. on doing so I am getting the correct result but the execution time is getting more than doubled. I am not able to figure out the problem — Manish Khilnani
– Manish Khilnani, Commented Jun 4, 2015 at 8:57

Avi Ginsburg · Accepted Answer · 2015-06-04 10:44:46Z

1

Depending exactly from which loop you want to parallelize, you have three options:

#pragma omp parallel
{
#pragma omp for    // Option #1
    for(m=0; m<r_c; m++)
    {
        for(n=0; n<c_c; n++)
        {
            double value = 0.0;
#pragma omp for    // Option #2
            for(j=0; j<r_b; j++)
                for(k=0; k<c_b; k++)
                {
                    double a;
                    if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                        a = 0.0;
                    else
                        a = h_a[((m-j)*c_a) + (n-k)];
                    //printf("%lf\t", a);
                    value += h_b[(j*c_b) + k] * a;
                }
                h_c[m*c_c + n] = value;
                //printf("%lf\t", h_c[m*c_c + n]);
        }
        //cout<<"row "<<m<<" completed"<<endl;
    }
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
    for(n=0; n<c_c; n++)
    {
#pragma omp parallel
            {
            double value = 0.0;
#pragma omp for
            for(j=0; j<r_b; j++)
                for(k=0; k<c_b; k++)
                {
                    double a;
                    if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                        a = 0.0;
                    else
                        a = h_a[((m-j)*c_a) + (n-k)];
                    //printf("%lf\t", a);
                    value += h_b[(j*c_b) + k] * a;
                }
            h_c[m*c_c + n] = value;
            //printf("%lf\t", h_c[m*c_c + n]);
        }
    }
    //cout<<"row "<<m<<" completed"<<endl;
}

Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.

Edit

I've adopted the MCVE supplied in the comments as follows:

#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>

#define W_OMP
int main(int argc, char *argv[])
{
    std::vector<double> h_a(9);
    std::generate(h_a.begin(), h_a.end(), std::rand);
    int r_b = 500;
    int c_b = r_b;
    std::vector<double> h_b(r_b * c_b);
    std::generate(h_b.begin(), h_b.end(), std::rand);
    int r_c = 500;
    int c_c = r_c;
    int r_a = 3, c_a = 3;
    std::vector<double> h_c(r_c * c_c);

    auto start = std::chrono::system_clock::now();

#ifdef W_OMP
#pragma omp parallel 
    {
#endif
        int m,n,j,k;
#ifdef W_OMP
#pragma omp for 
#endif
        for(m=0; m<r_c; m++)
        {
            for(n=0; n<c_c; n++)
            {
                double value = 0.0,a;
                for(j=0; j<r_b; j++)
                {
                    for(k=0; k<c_b; k++)
                    {
                        if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
                            a = 0.0;
                        else a = h_a[((m-j)*c_a) + (n-k)];
                        value += h_b[(j*c_b) + k] * a;
                    }
                }
                h_c[m*c_c + n] = value;
            }
        }
#ifdef W_OMP
    }
#endif
    auto end = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << elapsed.count() << "ms"
#ifdef W_OMP
        "\t with OMP"
#else
        "\t without OMP"
#endif
        "\n";

    return 0;

}

As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.

Benchmarks

Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).

Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.

Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.

Summary

As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:

Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

edited Jun 4, 2015 at 10:44

answered Jun 4, 2015 at 5:47

Avi Ginsburg

10.7k3 gold badges33 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Manish Khilnani Over a year ago

I am using the first option and specifying the number of threads as 4 I am getting the current output but the execution time is getting more than double.

Avi Ginsburg Over a year ago

@ManishKhilnani If all the results are the same, the last paragraph holds the key. If there isn't enough work being done (or if it's a memory bound problem) then parallelizing isn't going to help, and may even hurt. There are a plethora of questions on SO about OMP hurting performance. If you make a MCVE, I'll look deeper into it.

Avi Ginsburg Over a year ago

I don't want the whole code, just the MCVE. Just a main() which loads all the data and starts the convolution. The data can be synthetic so that they don't have to be explicitly loaded.

Manish Khilnani Over a year ago

'int main() { double h_a[9] = {1,2,3,4,5,6,7,8,9}; double h_b[25] = {1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7}; double h_c[49]; int r_c = 7,c_c = 7, r_a = 3, c_a = 3, r_b = 5,c_b = 5; int m,n,j,k; for(m=0; m<r_c; m++) { for(n=0; n<c_c; n++) { double value = 0.0,a; for(j=0; j<r_b; j++) { for(k=0; k<c_b; k++) { if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a) a = 0.0; else a = h_a[((m-j)*c_a) + (n-k)]; value += h_b[(jc_b) + k] * a; } } h_c[mc_c + n] = value; } } }'

Manish Khilnani Over a year ago

I am using #pragma omp parallel for num_threads(4) private(m,n,j,k) collapse(2) before the "for m" loop

|

Collectives™ on Stack Overflow

parallelizing in openMP

1 Answer 1

Edit

Benchmarks

Summary

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit

Benchmarks

Summary

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related