I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.
The inclusive operation is defined as following: for an input vector say [x1,x2,x3,x4] it ouputs the sequence of partial sums [x1, x1+x2, x1+x2+x3, x1+x2+x3+x4]. This is a highly parallelizable operation and ostensibly this has been implemented well in OpenMP.
I looked at the following reference: https://theartofhpc.com/pcse/omp-reduction.html#Scanprefixoperations (The manual reference https://www.openmp.org/spec-html/5.0/openmpsu45.html#x68-1940002.9.6 seems too cryptic to me right now)
The artofhpc website says that the reduction clause gets a modifier inscan:
#pragma omp parallel for reduction(inscan,+:sumvar)`
In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:
sumvar // update
#pragma omp scan inclusive(sumvar)
partials[i] = sumvar
I tried to follow the same syntax, to measure the performance versus standard serial reduction and the results were quite disappointing. My code is at the bottom of the post.
In the code, I am just doing a simple test by considering a very large vector of size 90 million consisting of random values in the interval [-1,1], and performing a scan on it using an increasing number of threads and measuring the speedup. Here are my results (I get the same answer on repeated runs). My laptop CPU has 16 hardware threads, but the overall speedup is a disappointing 1.36. (I would have expected something substantially more!)
Is there something wrong in the way I am using the OpenMP syntax for reduction?
➜ Desktop gcc -fopenmp scantest.c && ./a.out
NumThreads Speedup
1 0.458
2 1.173
3 1.424
4 1.686
5 1.635
6 1.501
7 1.522
8 1.499
9 1.455
10 1.416
11 1.395
12 1.393
13 1.352
14 1.336
15 1.353
16 1.357
#include<stdio.h>
#include<omp.h>
#include<math.h>
#include<stdlib.h>
#include<assert.h>
int main(int argc, char** argv){
int N = 9e7; // vector size
double* x; // vector to reduce
double* partials_s; // vector to scan into
double* partials_p; // vector to scan into
double end, start; // timing variables
double sumvar;
int tmax = argc>1? atoi(argv[1]):35;
int threadcount ;
// Allocate space for all vectors
x = (double*) malloc(sizeof(double)*N);
partials_s = (double*) malloc(sizeof(double)*N);
partials_p = (double*) malloc(sizeof(double)*N);
// Populate the input vectors
for(int i=0 ; i<N ; ++i){
x[i] = -1+2.0*rand()/(double)RAND_MAX;
partials_s[i] = 0.0;
partials_p[i] = 0.0;
}
//-----------------------------------------
// Serial inclusive scan
//-----------------------------------------
start = omp_get_wtime();
sumvar = 0.0;
for(int i=0 ; i<N ; ++i){
sumvar += x[i];
partials_s[i] = sumvar;
}
end = omp_get_wtime();
double stime = end-start; // Time to run the serial code
//-----------------------------------------------------------------------------
// Performance of parallel inclusive scan. Print ratio of serial/parallel time
//----------------------------------------------------------------------------
printf("\nNumThreads Speedup \n");
for(threadcount=1;threadcount<=tmax;++threadcount){
start = omp_get_wtime();
sumvar = 0.0;
#pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
for(int i=0 ; i<N ; ++i){
sumvar = sumvar + x[i]; // updating the value of sumvar
#pragma omp scan inclusive(sumvar)
partials_p[i] = sumvar;
}
end = omp_get_wtime();
double ptime = end-start;
printf("%d \t %.3f\n",threadcount,stime/ptime);
}
//for(int i=0 ; i<N ; ++i){
// printf("%.4f %.4f\n", partials_s[i], partials_p[i]);
//}
// Deallocate
free(x);
free(partials_s);
free(partials_p);
return 0;
}