How do I properly parallelise a for loop using OpenMP?

Question

I am testing OpenMP for C++ because my software will be heavily reliant on speed from processor parallelisation.

I am getting strange results when running the following code.

The speed up from parallelisation is not as much as I would expect
When not using -O flags, the code runs slower.

I am using the g++ compiler, version 7.3.0 and Ubuntu 18.04 OS on an i5-8600 CPU with 16 GB RAM.

Outputs:

Output 1 (Not allowed to embed yet since I'm a new member)

Transript:

.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.

Output 2

.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.

As you can see, for 6 processors I'm only getting ~2.9 times improvement in speed, unless I omit the -O flags, in which case the program runs much slower, but still uses all 6 processors at 100% utilisation (tested using htop).

Why is this? Also, what can I do to achieve the full 6x increase in performance?

Source code:

#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>

int main() {

    using namespace std::chrono;

    const int big_number = 1000000000;
    std::array<double, 6> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

    // Sequential

    high_resolution_clock::time_point start_linear = high_resolution_clock::now();

    for(int i = 0; i < 6; i++) {
        for(int j = 0; j < big_number; j++) {
            array[i]++;
        }   
    }

    high_resolution_clock::time_point end_linear = high_resolution_clock::now();

    // Parallel 

    high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

    array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

    #pragma omp parallel
    {
        #pragma omp for
        for(int i = 0; i < 6; i++) {
            for(int j = 0; j < big_number; j++) {
                array[i]++;
            }   
        }
    }

    high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

    // Stats.

    std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

    duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
    std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
    std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    return EXIT_SUCCESS;
}

Please specify the processor and memory that you use. Also do not ever bother measuring or optimizing unoptimized code. — Zulan
– Zulan, Commented Sep 4, 2018 at 7:10
Thanks for the informative and helpful answers. How do I determine the number of threads available at run time? I used 6 as I am running on a 6-core CPU. — Sebastian Wieczorek
– Sebastian Wieczorek, Commented Sep 4, 2018 at 7:31

user338371 · Accepted Answer · 2018-09-04 07:33:30Z

1

It seems your code was influenced by false sharing.

don't let different threads access the same cache line.A better way is try to not share variables between threads.

#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>

int main() {

  using namespace std::chrono;

  const int big_number = 1000000000;
  alignas(64) std::array<double, 6*8> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

  // Sequential

  high_resolution_clock::time_point start_linear = high_resolution_clock::now();

  for(int i = 0; i < 6; i++) {
    for(int j = 0; j < big_number; j++) {
      array[i]++;
    }
  }

  high_resolution_clock::time_point end_linear = high_resolution_clock::now();

  // Parallel

  high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

  array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

      #pragma omp parallel
  {
            #pragma omp for
    for(int i = 0; i < 6; i++) {
      for(int j = 0; j < big_number; j++) {
        array[i*8]++;
      }
    }
  }

  high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

  // Stats.

  std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

  duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
  std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
  std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  return EXIT_SUCCESS;
}

8 processors used.

Linear action took: 26.9021 seconds.

Parallel action took: 6.41319 seconds.

And you can read this.

edited Sep 4, 2018 at 7:33

answered Sep 4, 2018 at 2:20

user338371

1406 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user338371 Over a year ago

Hi,I I edited the code and added alignas(64) before the defination of array,you can try it again . A better way is try to not share variables between threads. I cannont fingure out why not 6x either.

Zulan Over a year ago

While the code looks like false sharing for sure, I highly doubt this is practically the case for the optimized code. In the disassembly you can see that gcc will put array[i] into a register. Also the performance would not even be faster by any factor than the serial code if it was affected by false sharing for every iteration. It may be the case for the unoptimized code, but that is irrelevant.

Collectives™ on Stack Overflow

How do I properly parallelise a for loop using OpenMP?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related