How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?

Question

I'm using OpenMP to parallelize the loop, that is internally using AVX-512 with Agner Fog's VCL Vector Class Library.

Here is the code:

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

  #pragma omp parallel for reduction(+:sumV,divV)
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  return horizontal_add(sumV);
}

When trying to compile the code above, I'm getting

g++ -Wall -Wextra -O3 -g -I include -fopenmp -m64 -mavx2 -mfma -std=c++17 -o harmonic_series harmonic_series.cpp
harmonic_series.cpp:87:40: error: user defined reduction not found for ‘sumV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)
      |                                        ^~~~
harmonic_series.cpp:87:45: error: user defined reduction not found for ‘divV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)

Any hints on how to solve this and provide the user-defined reduction for the Vec8d class? It's simply the plus operator which is defined by the VCL class, but I cannot find any example how to code this.

Thanks a lot for any help!

This answer: stackoverflow.com/questions/70678525/… helped to move forward. By adding #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig) I can make compiler happy - however, something is not right, I'm getting completely wrong sumV results. — Jirka
– Jirka, Commented Nov 22, 2022 at 3:05

Jirka · Accepted Answer · 2022-11-23 00:40:13Z

1

Here is the final solution. It's using automatic reduction and avoids u64->fp conversion by computing divV = startdivV + i * addV only in the first iteration of each loop and then using divV += addV for all other iterations. Runtime to compute the sum of first 9.6e10 elements is {real 9s, user 1m46s} with 12 threads on Intel Core i7-10850H CPU - it's the same as for the manual reduction.

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;
  bool first_loop = true;
  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
//It's important to mark "first_loop" variable as firstprivate so that each private copy gets initialized.
  #pragma omp parallel for firstprivate(first_loop) lastprivate(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    if (first_loop) {
      divV = startdivV + i * addV;
      first_loop = false;
    } else {
      divV += addV;
    }
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}

answered Nov 23, 2022 at 0:40

Jirka

4055 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jirka Over a year ago

Complete code is available here: github.com/jirka-h/harmonic_series

Jirka · Accepted Answer · 2022-11-23 00:44:45Z

1

I have found two solutions:

Using #pragma omp declare reduction

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);
  const Vec8d startdivV = divV;

  #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
  #pragma omp parallel for private(divV) reduction(+:sumV)
  for(i=0; i<N; ++i) {
    divV = startdivV + i * addV;
    sumV += oneV / divV;
  }
  return horizontal_add(sumV);
}

Drawback - there is some performance penalty since divisor has to be computed each time as

divV = startdivV + i * addV;

instead of incrementing it at each step by addV

divV += addV;

(multiplication is slower than addition).

Question: can I use omp parallel for private, but set divisor at start for each thread and then only increment it?

divV = startdivV + start_value_of_i * addV;
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }

Use omp parallel, omp for, and omp critical to implement reduction manually. It gives me more control, but code is longer and more error prone. The implementation below also assumes that number of iterations is divisible by number of threads.

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

/*
Manual reduction
This code blindly assumes that N is divisible by number of threads
https://stackoverflow.com/questions/18746282/openmp-schedulestatic-with-no-chunk-size-specified-chunk-size-and-order-of-as
*/
  unsigned long long int start, end;
  int thread_num, num_threads;
  Vec8d localsumV;
  Vec8d localdivV;

  #pragma omp parallel private( i, thread_num, num_threads, start, end, localsumV, localdivV )
  {
    thread_num = omp_get_thread_num();
    num_threads = omp_get_num_threads();
    start = thread_num * N / num_threads;
    end = (thread_num + 1) * N / num_threads;
    localsumV = sumV;
    localdivV = start* addV + divV;
    for(i=start; i<end; ++i) {
      localsumV += oneV / localdivV;
      localdivV += addV;
    }
    #pragma omp critical
    {
      sumV += localsumV;
    }
  return horizontal_add(sumV);
}

I like the first version more.

Any hints how to improve the first version and increment divisor in each iteration, instead of compute it as divV = startdivV + i * addV; ?

edited Nov 23, 2022 at 0:44

answered Nov 22, 2022 at 4:47

Jirka

4055 silver badges9 bronze badges

8 Comments

Peter Cordes Over a year ago

It's actually not that multiplication itself is slower than addition; same throughput and latency as addition on Zen4 (3c latency, 1c throughput for ZMM). vmulpd and vaddpd don't even compete for the same execution ports. (And the multiply+add can hopefully contract into one FMA.) It's that u64->fp conversion has a cost (but at least with AVX-512 it's still a single instruction). Plus a SIMD vector-int add, or a scalar broadcast of the loop counter. Perhaps with -ffast-math, a compiler might strength-reduce i*addV to asm equivalent to the divV += FP addition you had been using.

Peter Cordes Over a year ago

Do you need reduction(+ : divV) at all? Each thread needs its own version of it at some starting point, but it's just an induction variable, not a reduction; your code after the loop doesn't need its value. Combining the declare reduction stuff for Vec8d in general with just #pragma omp parallel for reduction(+ : sumV), GCC compiles it. I don't know if the resulting asm is actually correct, though; I didn't write a test caller. godbolt.org/z/xY15db1Wq . The asm has the expected insns, but with some store/reload so it might be terrible. And IDK about the init values.

Peter Cordes Over a year ago

Ok, I read some more about OpenMP, and apparently you do need to eliminate induction variables like divV because that would produce a race. (courses.engr.illinois.edu/cs420/fa2012/foils/OpenMPPart2.pdf) I guess the design intent is that the compiler can strength-reduce back to an induction variable if that's best?

Jirka Over a year ago

Hi Peter, thanks a lot for your comments. You are right, reduction(+ : divV) is not needed at all. I have removed it. For speed - you are right, that slowdown is not due to the addition vs FMA, but due to the u64->fp conversion. I have measured runtime with 12 threads and I got {real 9s, user 1m46s} for manual reduction, and {real 9.6s; user 1m52s} for automatic reduction with divV = startdivV + i * addV; I was able to achieve the same speed for automatic reduction by doing divV = startdivV + i * addV; only in first round and then using divV += addV; I will post final solution below.

Jirka Over a year ago

Please check the answer/code which I have just posted. Each thread will compute divV = startdivV + i * addV in the first iteration of the for loop (tracked by private variable bool first_loop for each thread) and only in subsequent iterations use divV += addV. This works since divV is private variable. In essence, I explicitly set the starting value of divV for each thread.

|

Collectives™ on Stack Overflow

How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?

2 Answers 2

1 Comment

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related