c++ std library parallel execution with for_each slower than sequential loop

Question

Since c++17, the std library has parallel algorithms, so I tried with the following code, summing a list of numbers and want to see if there is any performance gains.

#include <algorithm>
#include <chrono>
#include <execution>
#include <numeric>
#include <iostream>
#include <vector>

int main() {
  size_t n = 100000000;
  std::vector<size_t> vec(n);
  std::iota(vec.begin(), vec.end(), 0);

  auto par_sum = [&](size_t k) {
    auto t1 = std::chrono::high_resolution_clock::now();

    std::vector<size_t> rez(k);
    std::iota(rez.begin(), rez.end(), 0);
    size_t batch = static_cast<size_t>(n / k) + 1;
    std::for_each(std::execution::par_unseq, rez.begin(), rez.end(),
      [&](size_t id) {
        size_t cum = 0;
        for (size_t i = id*batch; i < std::min((id+1)*batch, n); ++i) {
          cum += vec[i];
        }
        rez[id] = cum;
    });

    auto t2 = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
    std::cout << "n_worker = " << k
      << ", time = " << duration
      << ", rez = " << std::accumulate(rez.begin(), rez.end(), 0lu)
      << std::endl;
  };

  par_sum(1);
  par_sum(3);
  par_sum(5);
}

Compiled by

g++  -std=c++17 -L/usr/local/lib -O3 -mavx -ltbb a.cpp

Results show that

n_worker = 1, time = 51875, rez = 4999999950000000
n_worker = 3, time = 57616, rez = 4999999950000000
n_worker = 5, time = 63193, rez = 4999999950000000

Questions,

No performance gains against 1 worker, why?

100000000 elements of size_t is almost a gigabyte of data. You're almost certainly memory bound. — Mooing Duck
– Mooing Duck, Commented Oct 12, 2020 at 19:56
If you had less than a GB of memory, then you'd be disk bound. You're memory bound because your CPU doesn't have 1GB of L1 cache. L1 cache is usually 256KB-1MB. L2 cache is 256KB to 8MB. L3 cache is usually 4MB to upwards of 50MB. Since you have more data than that, the CPU is spending all it's time waiting for more data to come back from the RAM sticks. — Mooing Duck
– Mooing Duck, Commented Oct 12, 2020 at 19:59
offtopic: IMPO lambdas like this par_sum are harmful. Can't you define a class with a method to make this easier to read and maintain. — Marek R
– Marek R, Commented Oct 12, 2020 at 19:59
My result: n_worker = 1, time = 67489, rez = 887459712 n_worker = 3, time = 25965, rez = 887459712 n_worker = 5, time = 16598, rez = 887459712 — Ted Lyngmo
– Ted Lyngmo, Commented Oct 12, 2020 at 20:57

Den-Jason · Accepted Answer · 2020-10-12 20:14:06Z

3

I would posit that for a small amount of work it may be the case that a tight loop could execute in one CPU by purely keeping within the L1 cache context. As soon as you increase the amount of parallelism on the same data you begin to invoke overhead with cache consistency and page faults.

It would probably be worth you looking at ways of counting cache misses such as those suggested here: Programmatically counting cache faults

and also see: What is a "cache-friendly" code?

plus other resources on "cache friendly code" and "data oriented design": https://www.youtube.com/watch?v=b5v9aElYU2I

and https://youtu.be/fHNmRkzxHWs

This is related to false sharing which is explained here: https://www.youtube.com/watch?v=dznxqe1Uk3E

edited Oct 12, 2020 at 20:14

answered Oct 12, 2020 at 19:58

Den-Jason

2,5931 gold badge23 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

24 Comments

avocado Over a year ago

Thanks for sharing the above links! Actually the code in my post is meant to verify paralell for_each could boost perf or not, if it doesn't work that way, then I would investigate further and see what tricks to use. But looks like what you shared here doesn't seem to be a simple solution.

Mooing Duck Over a year ago

Or if you're doing more work on each item of data, yes. stackoverflow.com/questions/4087280/… implies that loading data from RAM takes approximately the same time as a 60 to 100 addition operations.

avocado Over a year ago

Tried std::execution::par just now, same time across 1/3/5 workers. @MooingDuck, even with much less data (1m numbers now).

Mooing Duck Over a year ago

Sure. With any amount of numbers, one thread will always best the same or fastest, as long as you're only doing a single addition per item. If you were doing something CPU bound, such as encryption or compression per item, then you'd get gains from more threads.

Ted Lyngmo Over a year ago

I'm not sure about that. Every read drags vec into the cacheline, I think you should keep all the values that are used by one thread really close (preferably within std::hardware_constructive_interference_size) and separate the different threads datasets with at least std::hardware_destructive_interference_size,.

|

Collectives™ on Stack Overflow

c++ std library parallel execution with for_each slower than sequential loop

1 Answer 1

24 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

24 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related