C++ OpenMP: nested loops where the inner iterator depends on the outer one

Question

Consider the following code:

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <cmath>
#include <omp.h>

using namespace std;

typedef std::chrono::steady_clock myclock;

double measure_time(myclock::time_point begin, myclock::time_point end)
{
    return std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/(double)1e6;
}

int main()
{
    int n = 20000;
    vector<double> v(n);
    iota(v.begin(), v.end(), 1.5);

    vector< vector<double> > D(n, vector<double>(n,0.0));

    myclock::time_point begin, end;

    begin = myclock::now();

    //#pragma omp parallel for collapse(2)
    //#pragma omp parallel for
    for(size_t i = 0; i < n - 1; i++){
        for(size_t j = i+1; j < n; j++){
            double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
            D[i][j] = d;
            D[j][i] = d;
        }
    }

    end= myclock::now();
    double time = measure_time(begin, end);
    cout<<"Time: "<<time<<" (s)"<<endl;
    return 0;
}

For compiling:

g++ -std=c++11 -fopenmp -o main main.cpp

I obtained the following run time:

With #pragma omp parallel for collapse(2): 7.9425 (s)
With #pragma omp parallel for: 3.73262 (s)
Without OpenGM: 11.0935 (s)

System settings: Linux Mint 18.3 64-bit, g++ 5.4.0, quad-core processor.

I would expect the first to be faster than the second (which parallelizes only the outer loop) and much faster than the third.

What did I do wrong please? The first and the second both ran on all 8 threads.

Thank you in advance for your help!

Please provide a minimal reproducible example as well as a detailed system description. It is impossible to provide a specific answer based on this code-snippet. — Zulan
– Zulan, Commented Dec 12, 2017 at 8:20
Collapse does not work with loops with dependencies stackoverflow.com/questions/28482833/… — Z boson
– Z boson, Commented Dec 13, 2017 at 12:10

Z boson · Accepted Answer · 2017-12-13 12:43:22Z

5

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

answered Dec 13, 2017 at 12:43

Z boson

34k14 gold badges132 silver badges238 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

f10w Over a year ago

Nice. With your code it is a little bit faster: ~3.3s. Maybe it will scale better on more expensive tasks. Thank you very much!

Collectives™ on Stack Overflow

C++ OpenMP: nested loops where the inner iterator depends on the outer one

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related