2

Consider the following code:

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <cmath>
#include <omp.h>

using namespace std;

typedef std::chrono::steady_clock myclock;

double measure_time(myclock::time_point begin, myclock::time_point end)
{
    return std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()/(double)1e6;
}

int main()
{
    int n = 20000;
    vector<double> v(n);
    iota(v.begin(), v.end(), 1.5);

    vector< vector<double> > D(n, vector<double>(n,0.0));

    myclock::time_point begin, end;

    begin = myclock::now();

    //#pragma omp parallel for collapse(2)
    //#pragma omp parallel for
    for(size_t i = 0; i < n - 1; i++){
        for(size_t j = i+1; j < n; j++){
            double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
            D[i][j] = d;
            D[j][i] = d;
        }
    }

    end= myclock::now();
    double time = measure_time(begin, end);
    cout<<"Time: "<<time<<" (s)"<<endl;
    return 0;
}

For compiling:

g++ -std=c++11 -fopenmp -o main main.cpp

I obtained the following run time:

  • With #pragma omp parallel for collapse(2): 7.9425 (s)
  • With #pragma omp parallel for: 3.73262 (s)
  • Without OpenGM: 11.0935 (s)

System settings: Linux Mint 18.3 64-bit, g++ 5.4.0, quad-core processor.

I would expect the first to be faster than the second (which parallelizes only the outer loop) and much faster than the third.

What did I do wrong please? The first and the second both ran on all 8 threads.

Thank you in advance for your help!

3
  • Please provide a minimal reproducible example as well as a detailed system description. It is impossible to provide a specific answer based on this code-snippet. Commented Dec 12, 2017 at 8:20
  • @Zulan Sorry for that. I have updated the question. Thanks. Commented Dec 12, 2017 at 13:28
  • Collapse does not work with loops with dependencies stackoverflow.com/questions/28482833/… Commented Dec 13, 2017 at 12:10

1 Answer 1

5

The collapse clause should not be used when the iterations are depended on another loop. See Understanding the collapse clause in openmp.

In your case you are running over the lower triangle of a matrix (excluding the diagonal) because of symmetry. This cuts the number of iterations roughly in half. If you want to fuse/collapse the double loop you can do it by hand like this (see the end of this answer for more details).

for(size_t k=0; k<n*(n-1)/2; k++) {
    size_t i = k/n, j = k%n;
    if(j<=i) i = n - i - 2, j = n - j - 1;
    double d = sqrt(v[i]*v[i] + v[j]*v[j] + 1.5*v[i]*v[j]);
    D[i][j] = d;
    D[j][i] = d;
}

I think most people assume that collapsing a loop is going to give better performance but this is often not the case. In my experience most of the time there is no difference in performance but in some cases it's much worse due to cache issues. In a few cases it's better. You have to test yourself.

As to why your code was twice as slow with the collapse clause I can only guess that since the effect is unspecified for the inner loop that your OpenMP implementation ran over j from [0,n) i.e. the full matrix rather than half the matrix.

Sign up to request clarification or add additional context in comments.

1 Comment

Nice. With your code it is a little bit faster: ~3.3s. Maybe it will scale better on more expensive tasks. Thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.