Using inclusive scan syntax in OpenMP in the C language

Question

I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.

The inclusive operation is defined as following: for an input vector say [x1,x2,x3,x4] it ouputs the sequence of partial sums [x1, x1+x2, x1+x2+x3, x1+x2+x3+x4]. This is a highly parallelizable operation and ostensibly this has been implemented well in OpenMP.

I looked at the following reference: https://theartofhpc.com/pcse/omp-reduction.html#Scanprefixoperations (The manual reference https://www.openmp.org/spec-html/5.0/openmpsu45.html#x68-1940002.9.6 seems too cryptic to me right now)

The artofhpc website says that the reduction clause gets a modifier inscan:

#pragma omp parallel for reduction(inscan,+:sumvar)`

In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:

  sumvar // update
#pragma omp scan inclusive(sumvar)
  partials[i] = sumvar

I tried to follow the same syntax, to measure the performance versus standard serial reduction and the results were quite disappointing. My code is at the bottom of the post.

In the code, I am just doing a simple test by considering a very large vector of size 90 million consisting of random values in the interval [-1,1], and performing a scan on it using an increasing number of threads and measuring the speedup. Here are my results (I get the same answer on repeated runs). My laptop CPU has 16 hardware threads, but the overall speedup is a disappointing 1.36. (I would have expected something substantially more!)

Is there something wrong in the way I am using the OpenMP syntax for reduction?

➜  Desktop gcc -fopenmp scantest.c  && ./a.out

NumThreads  Speedup
1        0.458
2        1.173
3        1.424
4        1.686
5        1.635
6        1.501
7        1.522
8        1.499
9        1.455
10       1.416
11       1.395
12       1.393
13       1.352
14       1.336
15       1.353
16       1.357

#include<stdio.h>
#include<omp.h>
#include<math.h>
#include<stdlib.h>
#include<assert.h>

int main(int argc, char** argv){

  int N = 9e7;     // vector size 
  double* x;                   // vector to reduce
  double* partials_s;          // vector to scan into
  double* partials_p;          // vector to scan into
  
  double end, start;           // timing variables
  double sumvar;
  
  int tmax = argc>1? atoi(argv[1]):35;
  int threadcount ;

  // Allocate space for all vectors
  x           = (double*) malloc(sizeof(double)*N);
  partials_s  = (double*) malloc(sizeof(double)*N);
  partials_p  = (double*) malloc(sizeof(double)*N);
    
  // Populate the input vectors
  for(int i=0 ; i<N ; ++i){
    x[i] = -1+2.0*rand()/(double)RAND_MAX;
    partials_s[i] = 0.0;
    partials_p[i] = 0.0;
  }

  //-----------------------------------------
  // Serial inclusive scan
  //-----------------------------------------
  start = omp_get_wtime(); 
  sumvar = 0.0;
  for(int i=0 ; i<N ; ++i){
      sumvar += x[i];
      partials_s[i] = sumvar;
  }
  end = omp_get_wtime();
  double stime = end-start; // Time to run the serial code


  //-----------------------------------------------------------------------------
  // Performance of parallel inclusive scan. Print ratio of serial/parallel time
  //----------------------------------------------------------------------------
  printf("\nNumThreads  Speedup \n");
  for(threadcount=1;threadcount<=tmax;++threadcount){

    start = omp_get_wtime();

    sumvar = 0.0;
    #pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
    for(int i=0 ; i<N ; ++i){

      sumvar = sumvar + x[i]; // updating the value of sumvar
      #pragma omp scan inclusive(sumvar)
      partials_p[i] = sumvar;
    }
    end = omp_get_wtime();
    double ptime = end-start;

    printf("%d \t %.3f\n",threadcount,stime/ptime);

  }

  //for(int i=0 ; i<N ; ++i){
  //  printf("%.4f  %.4f\n", partials_s[i], partials_p[i]);
  //}

  // Deallocate
  free(x);
  free(partials_s);
  free(partials_p);
  
  return 0;
}

Jérôme Richard · Accepted Answer · 2025-03-09 04:07:14Z

TL;DR: This does not scale because such a scan operation is simply memory bound. More specifically, it is bound by the bandwidth of the DRAM.

Analysis

On mainstream PCs, only few cores are able to saturate the bandwidth of the DRAM. Sometimes, a single core can (typically a PC with a good CPU and a a single-channel DRAM). The sequential scan is already a memory-bound operation, and on my machine (i5-9600KF CPU with a 2 x 3200MHz DDR4-DRAM) it is not that far from saturating the DRAM. Indeed, it reach 24 GiB/s of throughput. My DRAM can theoretically reach 48 GiB/s for only memory-reads, but in practice it is generally rather 40~42 GiB/s for reads and 36~38 GiB/s for mixed read/writes. This means the sequential code already saturate ~65% of my DRAM bandwidth!

With the parallel code, I reach 20 GiB/s with 1 thread, 30 GiB/s with 2 threads, 35 GiB/s with 3 threads, 36 GiB/s with 4 threads, 35 GiB/s with 6 threads (maximum useful since I only have 6 cores). As we can see, only 3 threads are enough to nearly saturate the bandwidth on my machine. 2 threads are actually not that far to saturate it (~80%).

In fact, this is rather consistent with the speed up printed by your program:

NumThreads  Speedup 
1    0.456
2    0.686
3    0.765     <---- optimal speed-up for the parallel code
4    0.749
5    0.700
6    0.666

With more than 3 threads, the performances actually decrease. This is probably because DRAM accesses (and the one of the L3 cache) look more random from the CPU perspective and random accesses are a bit less efficient (harder to prefetch). Moreover, cache trashing can cause more data to be loaded from the DRAM so the efficiency decrease while the throughput stay roughly stable.

There is another interesting thing: the parallel implementation is actually slower than the sequential one on my machine, even with 6 threads! This is because the parallel implementation needs to perform more work. Indeed, AFAIK, parallel scan requires multiple memory steps: for example, one to compute partial reduction and one to compute the actual scan by block based on the partial sums computed before. With this implementation, it means more data must be read from memory (twice more) making the operation slower if memory is already the bottleneck. There are other possible implementations but all the one I can think of require to read/write significantly more data from/to DRAM in this loop (with a default static scheduling).

Note I used GCC with the flags -fopenmp -O3 to compile the profiled program (-mavx2 did not impact profiling results). In practice, GOMP (the OpenMP implementation of GCC use to profile this on my machine) apparently perform 2 read/write steps which is a bit less efficient than the one mentioned above. I saw that by profiling the generated assembly code: we can see two hot memory-bound loops in the parallel code and the two seems to perform a kind of scan-like operation (certainly a local scan followed by a scan update). This 2 read/write step implementation should be theoretically twice slower. This is consistent with the speedup of the OpenMP implementation which is twice slower with 1 threads (both on your machine and mine).

Possible optimizations

There are 2 main optimizations we can perform to make this faster. However, they are not trivial to apply and have significant downsides.

First of all, x86-64 CPU actually read data to write it into DRAM due to cache-line write allocation. You can do non-temporal streaming stores to avoid this. This only worth it if you are sure written data cannot fit in cache. In general, developers cannot guarantee such a thing unless they know the target CPU architecture or make (sometime reasonable) assumptions on it. For example, they can assume the LLC cache is not bigger than few hundreds of MiB (which is not so true because there are some CPUs with a lot of LLC cache like Xeon Phi and future one might have such big LLC caches). OpenMP can theoretically generate non-temporal streaming store instructions with the nontemporal clause. That being said, it is AFAIK not yet implemented in mainstream OpenMP implementation at the time of writing. This optimization can make the sequential/parallel codes about 50% faster.

Another way to make this faster is to write your own more-efficient cache-friendly parallel implementation. You can read data chunk by chunk (with chunks fitting well in the LLC cache). Once a chunk is read from memory, you can compute the partial sums of the sub-chunks in parallel, and then compute the scan more efficiently (since data should already fit in the LLC cache and do not need to be reloaded). This should strongly improve the efficiency of the parallel implementations so to make it much more competitive with the sequential one (though still a bit slower with 1 thread). Note the additional synchronizations can decrease the performance of this implementation. Besides, NUMA effects can also make things more complicated on machines with multiple NUMA nodes. Ideally, doing such an optimization yourself is not great, and I think it should be the job of the OpenMP implementation...

The two optimizations can be combined together. The streaming store should reduce cache trashing and improve a bit the performance of the second optimization. On my machine, I expect the 2 optimizations to make the parallel implementation about twice faster than the provided sequential one. This still not scale but there is not much more you can do about that since the operation is memory-bound in the first place and memory-bound codes generally do not scale.

Thank you for the nice answer! It will take me some time to digest it fully. It is quite a bummer to see the default performance of scan to be so bad. In fact, just like you with -O3 turned on, my OpenMP code was worse for all thread counts.
So just to be sure, my syntax of using scan was correct stylistically speaking right? Also, if the performance is so bad, (forgive the naive question!) why even provide the implementation in the first place?
The syntax seems correct to me. Regarding performance, scan can be significantly faster in parallel with many cores and compute-bound operations (the Joachim's answer shows such a thing). For example, with 16x16 matrices items and an array fitting in the L3 cache should scale relatively well on 8 cores (though the implementation with 1 thread should still be slower 1.25~2 time slower than the sequential one) or >256x256 matrices in DRAM.
Moreover, on computing servers, 1 core can nearly never succeed to saturate the DRAM. Often, at least, 3-4 core are needed to saturate the DRAM. This is because the memory latency is bigger on servers (mainly due to bigger caches) so each core spend more time waiting on memory (even though it is not saturated). Since programs running on servers are supposed to be mostly parallel, theoretically, there should no need for 1 core to saturate memory (not that true in practice)... Parallel scan help in such case and OpenMP mainly targets (both CPU/GPU of) computing servers.

Joachim · Accepted Answer · 2025-03-09 08:11:11Z

An initial disclaimer: When benchmarking such simple codes, you always need to make sure that you produce an observable side effect (printf("%.4f %.4f\n", partials_s[N-1], partials_p[N-1]);) that ensures that the compiler will not optimize out the hole code.

Your problem has data dependencies for each iteration and is, therefore, difficult to efficiently execute in parallel. A naive parallel implementation would effectively translate to:

    #pragma omp parallel for num_threads(threadcount) ordered
    for(int i=0 ; i<N ; ++i){
      #pragma omp ordered
      {
        sumvar = sumvar + x[i]; // updating the value of sumvar
        partials_p[i] = sumvar;
      }
    }

This code effectively serializes the execution and adds additional thread synchronization cost between the iteration steps.

To actually calculate the scan in parallel, the compiler can generate code that distributes the loop iterations to the threads. In a first step, each thread performs the scan on the local chunk. In a second step, the threads determine the offset for their thread and need to add the offset to all of their array values. As already pointed out in the other answer, your problem is memory-bound and adding a second pass on the whole array is not very beneficial.

Now, how can this ever be beneficial? OpenMP is generally not good for distributing fine-grained and/or memory-bound workloads. The usecase for OpenMP scan is to have a loop with computational work to distribute to multiple threads and as a result you need the scan of the iterations' results.

    #pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
    for(int i=0 ; i<N ; ++i){
      double ires = sin(i+.5) * cos(i+.5); // calculate something for each iteration
      sumvar = sumvar + ires;
      #pragma omp scan inclusive(sumvar)
      partials_p[i] = sumvar;
    }

Conceptually, this can again be understood as:

    #pragma omp parallel for num_threads(threadcount) ordered
    for(int i=0 ; i<N ; ++i){
      double ires = sin(i+.5) * cos(i+.5); // calculate something for each iteration
      #pragma omp ordered
      {
        sumvar = sumvar + ires; // updating the value of sumvar
        partials_p[i] = sumvar;
      }
    }

How good it would scale with ordered depends on the work to be done in each iteration, i.e., to calculate ires. Additionally, it will only scale well, if each iteration has the same calculation cost, because we synchronize with each iteration. Any imbalance between the iterations will hit the scalability. Using scan has the advantage that we don't synchronize between each iteration, but only at the end. So, imbalances between the iterations can average out as along as the imbalances are not systematic between the threads.

With substantial compute cost for each iteration, the final pass through memory will not be that expensive.

When I execute the modified code from above compiled with gcc -fopenmp -O3 -lm on my notebook with 4 cores, I get these scaling results:

NumThreads  Speedup 
1    0.870
2    1.849
3    2.568
4    3.607
5    2.838
6    3.247
7    3.654
8    3.031
9    3.868
10   3.568
11   3.852
12   3.759
13   3.971
14   3.961
15   4.005
16   4.016

Interesting! In fact I replaced double ires = sin(i+.5) * cos(i+.5); with double ires = sin(x[i]+.5) * cos(x[i]+.5); which of course reads from memory and I got a similar speedup profile like you (4.49 speedup for 16 threads)
Which basically means OpenMP scan is is more friendly to you when the function f used in computing the sequence of partial sums of the sequence [f(x1), f(x2), .... f(xn)] is compute intensive. f(x)=x the identity function used in computing the scan operation as usually defined is sadly not.

Victor Eijkhout · Accepted Answer · 2025-03-09 15:38:08Z

2

I've tried this program on a CPU with tons of bandwidth, and I'm still only getting a factor or 2 speedup, starting at 12 cores, and then never going higher. I blame a bad implementation of the reduction part of the scan. Also, the reduction goes up to 20-some levels, so there is tons of extra work, and with bad cacheline utilization. Since the loop has absolutely no work outside of the reduction I would indeed expect bad performance. Try introducing some amount of scalar work into the loop.

answered Mar 9 at 15:38

Victor Eijkhout

6,0002 gold badges29 silver badges31 bronze badges

2 Comments

smilingbuddha Mar 9 at 19:16

Indeed for a given sequence, [x1, x2, ... xn] I tried to compute the inclusive scan of the vector [f(x1), ... f(xn)] where f is a computationally intensive function like in Joachim's answer. I ended up getting speedups of 8 for functions which used sin heavily. Incidentally, I had linked your website (artofhpc) in my question, I would like to thank you for the wonderful resource, which I often use.

Victor Eijkhout Mar 9 at 21:46

I wondered if you had made the connection. You're welcome and thanks for the kind words.

Collectives™ on Stack Overflow

Using inclusive scan syntax in OpenMP in the C language

3 Answers 3

Analysis

Possible optimizations

4 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Analysis

Possible optimizations

4 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related