3

I am testing OpenMP for C++ because my software will be heavily reliant on speed from processor parallelisation.

I am getting strange results when running the following code.

  • The speed up from parallelisation is not as much as I would expect
  • When not using -O flags, the code runs slower.

I am using the g++ compiler, version 7.3.0 and Ubuntu 18.04 OS on an i5-8600 CPU with 16 GB RAM.

Outputs:

Output 1 (Not allowed to embed yet since I'm a new member)

Transript:

.../OpenMPTest$ g++ -O3 -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 2.87415 seconds.
Parallel action took: 0.99954 seconds.

Output 2

.../OpenMPTest$ g++ -o openmp main.cpp -fopenmp
.../OpenMPTest$ ./openmp
6 processors used.
Linear action took: 25.7037 seconds.
Parallel action took: 68.0485 seconds.

As you can see, for 6 processors I'm only getting ~2.9 times improvement in speed, unless I omit the -O flags, in which case the program runs much slower, but still uses all 6 processors at 100% utilisation (tested using htop).

Why is this? Also, what can I do to achieve the full 6x increase in performance?

Source code:

#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>

int main() {

    using namespace std::chrono;

    const int big_number = 1000000000;
    std::array<double, 6> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

    // Sequential

    high_resolution_clock::time_point start_linear = high_resolution_clock::now();

    for(int i = 0; i < 6; i++) {
        for(int j = 0; j < big_number; j++) {
            array[i]++;
        }   
    }

    high_resolution_clock::time_point end_linear = high_resolution_clock::now();

    // Parallel 

    high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

    array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

    #pragma omp parallel
    {
        #pragma omp for
        for(int i = 0; i < 6; i++) {
            for(int j = 0; j < big_number; j++) {
                array[i]++;
            }   
        }
    }

    high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

    // Stats.

    std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

    duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
    std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
    std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

    return EXIT_SUCCESS;
}
2
  • 1
    Please specify the processor and memory that you use. Also do not ever bother measuring or optimizing unoptimized code. Commented Sep 4, 2018 at 7:10
  • Thanks for the informative and helpful answers. How do I determine the number of threads available at run time? I used 6 as I am running on a 6-core CPU. Commented Sep 4, 2018 at 7:31

1 Answer 1

1

It seems your code was influenced by false sharing.

don't let different threads access the same cache line.A better way is try to not share variables between threads.

#include <iostream>
#include <ctime>
#include <ratio>
#include <chrono>
#include <array>
#include <omp.h>

int main() {

  using namespace std::chrono;

  const int big_number = 1000000000;
  alignas(64) std::array<double, 6*8> array = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };

  // Sequential

  high_resolution_clock::time_point start_linear = high_resolution_clock::now();

  for(int i = 0; i < 6; i++) {
    for(int j = 0; j < big_number; j++) {
      array[i]++;
    }
  }

  high_resolution_clock::time_point end_linear = high_resolution_clock::now();

  // Parallel

  high_resolution_clock::time_point start_parallel = high_resolution_clock::now();

  array = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

      #pragma omp parallel
  {
            #pragma omp for
    for(int i = 0; i < 6; i++) {
      for(int j = 0; j < big_number; j++) {
        array[i*8]++;
      }
    }
  }

  high_resolution_clock::time_point end_parallel = high_resolution_clock::now();

  // Stats.

  std::cout << omp_get_num_procs() << " processors used." << std::endl << std::endl;

  duration<double> time_span = duration_cast<duration<double>>(end_linear - start_linear);
  std::cout << "Linear action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  time_span = duration_cast<duration<double>>(end_parallel - start_parallel);
  std::cout << "Parallel action took: " << time_span.count() << " seconds." << std::endl << std::endl;

  return EXIT_SUCCESS;
}

8 processors used.

Linear action took: 26.9021 seconds.

Parallel action took: 6.41319 seconds.

And you can read this.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi,I I edited the code and added alignas(64) before the defination of array,you can try it again . A better way is try to not share variables between threads. I cannont fingure out why not 6x either.
While the code looks like false sharing for sure, I highly doubt this is practically the case for the optimized code. In the disassembly you can see that gcc will put array[i] into a register. Also the performance would not even be faster by any factor than the serial code if it was affected by false sharing for every iteration. It may be the case for the unoptimized code, but that is irrelevant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.