5

I'm trying to parallelize a part of a larger program using the C++ standard library and its execution policies. The original program uses std::accumulate to calculate sums over columns of 2d vectors (vectors of vectors) but since std::accumulate doesn't accept execution policies I'm trying to find a parallelizable alternative.

I tried switching to using std::reduce instead of std::accumulate. I'm quite new to C++, but from what I gathered from the C++ reference they should work quite similarly. However, my code does not compile after making this change. Why doesn't the modified code work and how can I fix it? Is there a better way to parallelize the sum over a column of a 2d vector (vector of vectors) using the C++ standard library? The implementation should work for both CPU and GPU parallelization.

Minimal reproducible example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

int main(int argc, char *argv[])
{
    std::vector<std::vector<double>> vec = {{1,2,3}, {4,5,6}, {7,8,9}};

    // working sequential version of sum over second column of vec
    double res = std::accumulate(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl; // prints 15 as expected

    // same but with reduce, does not compile
    res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl;
}

Trying to compile this, I get the errors

g++-10 -std=c++20 program.cpp

program.cpp:16:90: error: subscripted value is neither array nor pointer
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                   

program.cpp:16:87: error: no match for ‘operator+’ (operand types are ‘std::vector<double>’ and ‘__gnu_cxx::__alloc_traits<std::allocator<double>, double>::value_type’ {aka ‘double’})
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                                                                                   ~~~~^~~~
11
  • 2
    Please edit your question and post a minimal reproducible example with #include and main(), and the verbatim error message in a code block. It will help users with similar problems to find this question and future answers. Commented Apr 27 at 23:33
  • 3
    For reduce, the predicate must accept any combination of double and std::vector<double> parameters, in any order - four possible combinations in all. This is what makes it parallellizable. This would be easiest to do using a named class with four overloads of operator(). A lambda would need a chain of if constexpr to achieve the same effect. However, what you really want is transform_reduce, like this: res = std::transform_reduce(vec.begin(), vec.end(), 0.0, std::plus{}, [](const auto& v) { return v[1]; }); Demo Commented Apr 28 at 0:01
  • Any thread-based parallelization comes with a quite large overhead. For things as simple as addition, you should first check avx2 parallelization, which alone could give you around 80% gains. Commented Apr 28 at 12:01
  • Side note: remember that accessing memory in row order is usually much faster than column order, since it's a lot friendlier to the CPUs prefetcher. So do think about whether you can lay out your data differently so it helps the CPU help you. Commented Apr 28 at 12:15
  • @IgorTandetnik Why not put that answer as... an answer instead of a comment ? That would alllow to properly "close" the thread :) Commented Apr 28 at 12:30

1 Answer 1

6

Re: how to compactly achieve your goal. You are looking for std::transform_reduce - it separates the action of transforming the input (in your case, selecting one element from each vector) from that of reducing the resulting transformed range. Like this:

res = std::transform_reduce(
    vec.begin(), vec.end(), 0.0,
    std::plus{},
    [](const auto& v) { return v[1]; });

Demo. Add execution policy to taste.


Re: why your attempt using std::reduce doesn't compile. When the type (say, V) of the initial value doesn't match that (say, E) of the range element, std::reduce expects a binary predicate that can take any combination of the two - four combinations in all. This is necessary to implement parallel execution.

E.g. in the simplest case, std::reduce could split the range in two, reduce each half, and combine the results. But then, it wouldn't have the initial value for the second half; so it needs to be able to call pred(E, E), to start working on that half. And in the end, it would have two scalar values in hand, and would need pred(V, V) to combine them.

This could be achieved by writing a named class with four overloads of operator(). Or, much more verbosely, by writing a lambda with a chain of if constexpr - something like

[](const auto& a, const auto& b) {
  if constexpr(std::is_same_v<double, std::decay_t<decltype(a)>> &&
               std::is_same_v<double, std::decay_t<decltype(b)>>) {
    return a + b;
  }
  // ...
}

Or, again, use std::transform_reduce as a better fit for the original problem.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.