How to parallelize a sum over an arbitrary column of a vector of vectors using C++ standard library execution policies? [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 7 months ago.

Improve this question

I'm trying to parallelize a part of a larger program using the C++ standard library and its execution policies. The original program uses std::accumulate to calculate sums over columns of 2d vectors (vectors of vectors) but since std::accumulate doesn't accept execution policies I'm trying to find a parallelizable alternative.

I tried switching to using std::reduce instead of std::accumulate. I'm quite new to C++, but from what I gathered from the C++ reference they should work quite similarly. However, my code does not compile after making this change. Why doesn't the modified code work and how can I fix it? Is there a better way to parallelize the sum over a column of a 2d vector (vector of vectors) using the C++ standard library? The implementation should work for both CPU and GPU parallelization.

Minimal reproducible example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

int main(int argc, char *argv[])
{
    std::vector<std::vector<double>> vec = {{1,2,3}, {4,5,6}, {7,8,9}};

    // working sequential version of sum over second column of vec
    double res = std::accumulate(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl; // prints 15 as expected

    // same but with reduce, does not compile
    res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl;
}

Trying to compile this, I get the errors

g++-10 -std=c++20 program.cpp

program.cpp:16:90: error: subscripted value is neither array nor pointer
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                   

program.cpp:16:87: error: no match for ‘operator+’ (operand types are ‘std::vector<double>’ and ‘__gnu_cxx::__alloc_traits<std::allocator<double>, double>::value_type’ {aka ‘double’})
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                                                                                   ~~~~^~~~

Please edit your question and post a minimal reproducible example with #include and main(), and the verbatim error message in a code block. It will help users with similar problems to find this question and future answers. — 3CEZVQ
– 3CEZVQ, Commented Apr 27 at 23:33
For reduce, the predicate must accept any combination of double and std::vector<double> parameters, in any order - four possible combinations in all. This is what makes it parallellizable. This would be easiest to do using a named class with four overloads of operator(). A lambda would need a chain of if constexpr to achieve the same effect. However, what you really want is transform_reduce, like this: res = std::transform_reduce(vec.begin(), vec.end(), 0.0, std::plus{}, [](const auto& v) { return v[1]; }); Demo — Igor Tandetnik
– Igor Tandetnik, Commented Apr 28 at 0:01
Any thread-based parallelization comes with a quite large overhead. For things as simple as addition, you should first check avx2 parallelization, which alone could give you around 80% gains. — Michaël Roy
– Michaël Roy, Commented Apr 28 at 12:01
Side note: remember that accessing memory in row order is usually much faster than column order, since it's a lot friendlier to the CPUs prefetcher. So do think about whether you can lay out your data differently so it helps the CPU help you. — Jesper Juhl
– Jesper Juhl, Commented Apr 28 at 12:15
@IgorTandetnik Why not put that answer as... an answer instead of a comment ? That would alllow to properly "close" the thread :) — Fareanor
– Fareanor, Commented Apr 28 at 12:30

Igor Tandetnik · Accepted Answer · 2025-04-28 12:56:03Z

Re: how to compactly achieve your goal. You are looking for std::transform_reduce - it separates the action of transforming the input (in your case, selecting one element from each vector) from that of reducing the resulting transformed range. Like this:

res = std::transform_reduce(
    vec.begin(), vec.end(), 0.0,
    std::plus{},
    [](const auto& v) { return v[1]; });

Demo. Add execution policy to taste.

Re: why your attempt using std::reduce doesn't compile. When the type (say, V) of the initial value doesn't match that (say, E) of the range element, std::reduce expects a binary predicate that can take any combination of the two - four combinations in all. This is necessary to implement parallel execution.

E.g. in the simplest case, std::reduce could split the range in two, reduce each half, and combine the results. But then, it wouldn't have the initial value for the second half; so it needs to be able to call pred(E, E), to start working on that half. And in the end, it would have two scalar values in hand, and would need pred(V, V) to combine them.

This could be achieved by writing a named class with four overloads of operator(). Or, much more verbosely, by writing a lambda with a chain of if constexpr - something like

[](const auto& a, const auto& b) {
  if constexpr(std::is_same_v<double, std::decay_t<decltype(a)>> &&
               std::is_same_v<double, std::decay_t<decltype(b)>>) {
    return a + b;
  }
  // ...
}

Or, again, use std::transform_reduce as a better fit for the original problem.

Collectives™ on Stack Overflow

How to parallelize a sum over an arbitrary column of a vector of vectors using C++ standard library execution policies? [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related