2

Since c++17, the std library has parallel algorithms, so I tried with the following code, summing a list of numbers and want to see if there is any performance gains.

#include <algorithm>
#include <chrono>
#include <execution>
#include <numeric>
#include <iostream>
#include <vector>

int main() {
  size_t n = 100000000;
  std::vector<size_t> vec(n);
  std::iota(vec.begin(), vec.end(), 0);

  auto par_sum = [&](size_t k) {
    auto t1 = std::chrono::high_resolution_clock::now();

    std::vector<size_t> rez(k);
    std::iota(rez.begin(), rez.end(), 0);
    size_t batch = static_cast<size_t>(n / k) + 1;
    std::for_each(std::execution::par_unseq, rez.begin(), rez.end(),
      [&](size_t id) {
        size_t cum = 0;
        for (size_t i = id*batch; i < std::min((id+1)*batch, n); ++i) {
          cum += vec[i];
        }
        rez[id] = cum;
    });

    auto t2 = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
    std::cout << "n_worker = " << k
      << ", time = " << duration
      << ", rez = " << std::accumulate(rez.begin(), rez.end(), 0lu)
      << std::endl;
  };

  par_sum(1);
  par_sum(3);
  par_sum(5);
}

Compiled by

g++  -std=c++17 -L/usr/local/lib -O3 -mavx -ltbb a.cpp

Results show that

n_worker = 1, time = 51875, rez = 4999999950000000
n_worker = 3, time = 57616, rez = 4999999950000000
n_worker = 5, time = 63193, rez = 4999999950000000

Questions,

  • No performance gains against 1 worker, why?
13
  • 100000000 elements of size_t is almost a gigabyte of data. You're almost certainly memory bound. Commented Oct 12, 2020 at 19:56
  • @MooingDuck my mac has 32GB, so I think it should be fine. Commented Oct 12, 2020 at 19:58
  • If you had less than a GB of memory, then you'd be disk bound. You're memory bound because your CPU doesn't have 1GB of L1 cache. L1 cache is usually 256KB-1MB. L2 cache is 256KB to 8MB. L3 cache is usually 4MB to upwards of 50MB. Since you have more data than that, the CPU is spending all it's time waiting for more data to come back from the RAM sticks. Commented Oct 12, 2020 at 19:59
  • offtopic: IMPO lambdas like this par_sum are harmful. Can't you define a class with a method to make this easier to read and maintain. Commented Oct 12, 2020 at 19:59
  • 1
    My result: n_worker = 1, time = 67489, rez = 887459712 n_worker = 3, time = 25965, rez = 887459712 n_worker = 5, time = 16598, rez = 887459712 Commented Oct 12, 2020 at 20:57

1 Answer 1

3

I would posit that for a small amount of work it may be the case that a tight loop could execute in one CPU by purely keeping within the L1 cache context. As soon as you increase the amount of parallelism on the same data you begin to invoke overhead with cache consistency and page faults.

It would probably be worth you looking at ways of counting cache misses such as those suggested here: Programmatically counting cache faults

and also see: What is a "cache-friendly" code?

plus other resources on "cache friendly code" and "data oriented design": https://www.youtube.com/watch?v=b5v9aElYU2I

and https://youtu.be/fHNmRkzxHWs

This is related to false sharing which is explained here: https://www.youtube.com/watch?v=dznxqe1Uk3E

Sign up to request clarification or add additional context in comments.

24 Comments

Thanks for sharing the above links! Actually the code in my post is meant to verify paralell for_each could boost perf or not, if it doesn't work that way, then I would investigate further and see what tricks to use. But looks like what you shared here doesn't seem to be a simple solution.
Or if you're doing more work on each item of data, yes. stackoverflow.com/questions/4087280/… implies that loading data from RAM takes approximately the same time as a 60 to 100 addition operations.
Tried std::execution::par just now, same time across 1/3/5 workers. @MooingDuck, even with much less data (1m numbers now).
Sure. With any amount of numbers, one thread will always best the same or fastest, as long as you're only doing a single addition per item. If you were doing something CPU bound, such as encryption or compression per item, then you'd get gains from more threads.
I'm not sure about that. Every read drags vec into the cacheline, I think you should keep all the values that are used by one thread really close (preferably within std::hardware_constructive_interference_size) and separate the different threads datasets with at least std::hardware_destructive_interference_size,.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.