I am working on an implementation of the Hillis Steele Scan in OpenMP.
OpenMP does not provide a model of parallel computation appropriate for the algorithms described in the Hillis & Steele paper you reference. From their introductory comments:
In this article we describe a series of algorithms appropriate for
fine-grained parallel computers with general communications. We call
these algorithms data parallel algorithms because their parallelism
comes from simultaneous operations across large sets of data, rather
than from multiple threads of control. The intent is not so much to
present new algorithms (most have been described earlier in other
contexts), but rather to demonstrate a style of programming that is
appropriate for a machine with tens of thousands or even millions of
processors. The algorithms described all use O(N) processors to solve
problems of size N, typically in O(log N) time.
I am confident in saying that you do not have a machine with the physical and computational characteristics they go on to describe. And OpenMP, although it can make some use of SIMD capabilities of the host machine's hardware, provides fundamentally a multithreading approach to parallelism, not the "data parallel" approach on which the algorithms described in the paper are based.
Possibly you could simulate the needed characteristics via OpenMP or another thread-based approach, but you would need to add fairly fine-grained synchronization, which I anticipate would impact performance enough to make the algorithm uninteresting, except possibly for ginormous data sets.
my function is outputting the incorrect results
This is unsurprising, because it is full of data races. This is where the need for synchronization that I mentioned above comes in. You can reduce that need by introducing an additional array, the same size as the original one. At each iteration of the outer loop, the inner loop reads from one and writes to the other, then on the next iteration you flip. That reduces the need for synchronization to once per outer loop iteration, which OpenMP will provide for you automatically.
Here's a variation that works:
void prefixSumShared(double *arr, double *scratch, size_t size, int numThreads) {
if (size < 2) {
// nothing to do
return;
}
double *src = arr;
double *dest = scratch;
size_t step;
dest[0] = src[0];
for (step = 1; step < size; step <<= 1) {
// Because we're flipping buffers back and forth, we need to copy
// elements that were updated by the previous iteration, but otherwise
// would not be by this iteration:
memcpy(dest + (step >> 1), src + (step >> 1), (step >> 1) * sizeof *arr);
#pragma omp parallel for num_threads(numThreads)
for (size_t k = step; k < size; k++) {
dest[k] = src[k] + src[k - step];
}
// OMP provides an implicit barrier here
scratch = src;
src = dest;
dest = scratch;
}
if (src != arr) {
// Copy elements updated in the last iteration
step >>= 1;
memcpy(arr + step, src + step, (size - step) * sizeof *arr);
}
}
It relies on the caller to provide appropriate scratch space, at least equal in size to the input array. It uses the buffer-flipping approach I described, which makes it a bit more complex than it might be, but saves some copying. An alternative would be to always sum from arr into scratch, and then copy back the changes on each iteration, but that would involve asymptotically more copying (O(n log n) instead of O(n)).
I have also removed the do-nothing iterations of the inner loop. They make sense for the computing environment assumed by Hillis & Steele, but not for OpenMP. And I have restructured a bit to remove the need for FP operations.
arr. This algorithm does not work in-place.scanoperation 2. I don't understand yourk>power2test. If you let that loop go with strides of2*powerit's more likely to work. Even in place.