C++17 parallel algorithm vs tbb parallel vs openmp performance

Question

Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb and openmp, I changed my mind, I found the std library is much slower.

By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb or openmp, thanks!

Env:

Mac OSX, Catalina 10.15.7
GNU g++-10

Benchmark code:

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 1000000;

double std_for() {
  auto values = std::vector<double>(N);

  size_t n_par = 5lu;
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;

  for (double value : values)
  {
    total += value;
  }
  return total;
}

double tbb_for() {
  auto values = std::vector<double>(N);

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;
  for (double value : values) {
    total += value;
  }
  return total;
}

double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

double seq_for()
{
  auto values = std::vector<double>(N);

  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

void time_it(double(*fn_ptr)(), const std::string& fn_name) {
  auto t1 = std::chrono::high_resolution_clock::now();
  auto rez = fn_ptr();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
  std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}

int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it(&std_for, op);
  } else if (op == "omp_for") {
    time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

Compile options:

g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp

Results:

std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910

We can see that std_for is faster than seq_for(sequential for-loop), but it's still much slower than tbb and openmp.

UPDATE

As people suggested in comments, I run each for separately to be fair. The above code is updated, and results as follows,

>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885

>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619

>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052

>>> ./a.out std_for
std_for, rez = 500106, dur = 12423

And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.

Do you get similar results if you call the various methods in a different order? It is possible that the various vectors reuse memory that was freed by the previous function, resulting in fewer cache misses for the later functions. — 1201ProgramAlarm
– 1201ProgramAlarm, Commented Oct 12, 2020 at 23:11
OpenMP uses SIMD optimizations for std::exp and std::sin. You can try to change your benchmark, i.e. build all tests to separate executeables and use max optimizations like: -mntune=native -O3. — Victor Gubin
– Victor Gubin, Commented Oct 12, 2020 at 23:35
@VictorGubin No, there is no SIMD optimization on GCC. Firstly, simd is not specified (although GCC generally does not care about it). Moreover, --fast-math is sadly required so far on GCC for the vectorization to be applied (because supporting strict IEEE-754 compliance is hard). Actually, vectorization is done independently of OpenMP here on GCC. You can check vectorization here. — Jérôme Richard
– Jérôme Richard, Commented Oct 13, 2020 at 7:51
Running all of them methods in the same execution of the code may lead to over-subscription (since they will likely each create their own pool of threads). Also thread creation is expensive, so you should either run your parallel region twice and time the second, or run an empty parallel operation (to start the threads) then time your real one. — Jim Cownie
– Jim Cownie, Commented Oct 13, 2020 at 8:29
The first OpenMP parallel region is slow(-er) because it brings up the thread team. Always measure the performance of OpenMP programs after one "warm-up" parallel region. I asked you to switch the calls as that will move the startup overhead from the std_for() part to the omp_for() part. — Hristo Iliev
– Hristo Iliev, Commented Oct 13, 2020 at 22:21

Ralf Ulrich · Accepted Answer · 2022-01-23 11:39:40Z

You already found that it matters what exactly is to be measured and how this is done. Your final task will certainty be quite different from this simple exercise and not entirely reflect the results found here.

Besides caching and warming-up that are affected by the sequence of doing tasks (you studied this explicitly in your updated question) there is also another issue in your example you should consider.

The actual parallel code is what matters. If this does not determine your performance/runtime than parallelization is not the right solution. But in your example you measure also resource allocation, initialization and final computation. If those drive the real costs in your final application, again, parallelization is not the silver bullet. Thus, for a fair comparison and to really measure the actual parallel code execution performance. I suggest to modify your code along this line (sorry, I don't have openmp installed) and continue your studies:

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 10000000; // #1

void std_for(std::vector<double>& values, 
             std::vector<size_t> const& indices, 
             size_t const stride) {

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });
}

void tbb_for(std::vector<double>& values) {

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

}

/*
double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}
*/

void seq_for(std::vector<double>& values)
{
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }
}

void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}

void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  size_t n_par = 5lu;  // #2
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;
  
  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values, indices, stride);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}



int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it_std(&std_for, op);
    //  } else if (op == "omp_for") {
    //time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

On my (slow) system this results in:

std_for, res = 5.00046e+06, dur = 66393
tbb_for, res = 5.00046e+06, dur = 51746
seq_for, res = 5.00046e+06, dur = 196156

I note here that the difference from seq_for to tbb_for has further increased. It is now ~4x while in your example it looks more like ~3x. And std_for is still about 20..30% slower than tbb_for.

However, there are further parameters. After increasing N (see #1) by a factor of 10 (ok, this is not very important) and n_par (see #2) from 5 to 100 (this is important) the results are

tbb_for, res = 5.00005e+07, dur = 486179
std_for, res = 5.00005e+07, dur = 479306

Here std_for is on-par with tbb_for!

Thus, to answer your question: I clearly would NOT discard c++17 std parallelization right away.

Emma Jane Bonestell · Accepted Answer · 2022-06-17 15:27:35Z

1

Perhaps you already know, but something I don't see mentioned here is the fact that (at least for gcc and clang) the PSTL is actually implemented using/backended by TBB, OpenMP (currently on clang, only, I believe), or a sequential version of it.

I'm guessing you're using libc++ since you are on Mac; as far as I know, for Linux at least, the LLVM distributions do not come with the PSTL enabled, and if building PSTL and libcxx/libcxxabi from source, it defaults to a sequential backend.

https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h

answered Jun 17, 2022 at 15:27

Emma Jane Bonestell

801 gold badge1 silver badge12 bronze badges

Comments

Maruf Hossain · Accepted Answer · 2022-04-09 22:27:37Z

0

OpenMp is good for straight forward parallel codding.
On the other hand TBB use work-stealing mechanism which can give you better performance for loops that are imbalance and nested.
I prefer TBB for complex and nested parallelism over OpenMP.(OpenMP has a huge over-head for the nested parallelism)

answered Apr 9, 2022 at 22:27

Maruf Hossain

396 bronze badges

Collectives™ on Stack Overflow

C++17 parallel algorithm vs tbb parallel vs openmp performance

UPDATE

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

UPDATE

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related