Prefix Sum Parallel Algorithm

Question

I am working on a parallel sum scan algorithm and my results are incorrect.

I am working on an implementation of the Hillis Steele Scan in OpenMP.

My function is outputting incorrect results:

void prefixSumShared(double *arr, int size, int numThreads)
{    
     double logsize = ceil(log2(size));
  
     for (int j = 1; j <= logsize; j++)
     {
          int power = 1 << (j - 1); // 2^(j-1) 
          int power2 = 1 << j; // 2^j

          #pragma omp parallel for num_threads(numThreads)
          for (int k = 1; k < size; k++)
          {
               if (k >= power2)
               {
                    arr[k] = arr[k] + arr[k - power];
               }
          }
     }
}

My output changes for every number of threads I am using.

When the length of the array is numbers between 0 and 9 with 2 threads

{0.000000,1.000000,2.000000,3.000000,4.000000,5.000000,6.000000,7.000000,8.000000,9.000000}

My output is

{0.000000,1.000000,3.000000,7.000000,13.000000,23.000000,37.000000,57.000000,83.000000,119.000000}

Once I figure out how to run this with known values I will change the input array to random values between 0 and 2.

I tried different approaches like the one found in this Youtube video

I tried running it with various thread numbers but kept on getting the incorrect result.

You have a race condition on arr. This algorithm does not work in-place. — paleonix
– paleonix, Commented Apr 22, 2024 at 13:47
The paper you reference hypothesizes a massively parallel machine (thousands of CPUs) performing substantially all computations in SIMD mode. It makes some assumptions about memory characteristics, too. The paper is all about how you can leverage a machine with such characteristics, but your machine, I'm sure, is nothing like that. — John Bollinger
– John Bollinger, Commented Apr 22, 2024 at 13:51
1. OpenMP has a native scan operation 2. I don't understand your k>power2 test. If you let that loop go with strides of 2*power it's more likely to work. Even in place. — Victor Eijkhout
– Victor Eijkhout, Commented Apr 22, 2024 at 16:38

John Bollinger · Accepted Answer · 2024-04-22 18:38:33Z

I am working on an implementation of the Hillis Steele Scan in OpenMP.

OpenMP does not provide a model of parallel computation appropriate for the algorithms described in the Hillis & Steele paper you reference. From their introductory comments:

In this article we describe a series of algorithms appropriate for fine-grained parallel computers with general communications. We call these algorithms data parallel algorithms because their parallelism comes from simultaneous operations across large sets of data, rather than from multiple threads of control. The intent is not so much to present new algorithms (most have been described earlier in other contexts), but rather to demonstrate a style of programming that is appropriate for a machine with tens of thousands or even millions of processors. The algorithms described all use O(N) processors to solve problems of size N, typically in O(log N) time.

I am confident in saying that you do not have a machine with the physical and computational characteristics they go on to describe. And OpenMP, although it can make some use of SIMD capabilities of the host machine's hardware, provides fundamentally a multithreading approach to parallelism, not the "data parallel" approach on which the algorithms described in the paper are based.

Possibly you could simulate the needed characteristics via OpenMP or another thread-based approach, but you would need to add fairly fine-grained synchronization, which I anticipate would impact performance enough to make the algorithm uninteresting, except possibly for ginormous data sets.

my function is outputting the incorrect results

This is unsurprising, because it is full of data races. This is where the need for synchronization that I mentioned above comes in. You can reduce that need by introducing an additional array, the same size as the original one. At each iteration of the outer loop, the inner loop reads from one and writes to the other, then on the next iteration you flip. That reduces the need for synchronization to once per outer loop iteration, which OpenMP will provide for you automatically.

Here's a variation that works:

void prefixSumShared(double *arr, double *scratch, size_t size, int numThreads) {
    if (size < 2) {
        // nothing to do
        return;
    }

    double *src = arr;
    double *dest = scratch;
    size_t step;

    dest[0] = src[0];
    for (step = 1; step < size; step <<= 1) {
        // Because we're flipping buffers back and forth, we need to copy
        // elements that were updated by the previous iteration, but otherwise
        // would not be by this iteration:
        memcpy(dest + (step >> 1), src + (step >> 1), (step >> 1) * sizeof *arr);

        #pragma omp parallel for num_threads(numThreads)
        for (size_t k = step; k < size; k++) {
            dest[k] = src[k] + src[k - step];
        }
        // OMP provides an implicit barrier here

        scratch = src;
        src = dest;
        dest = scratch;
    }

    if (src != arr) {
        // Copy elements updated in the last iteration
        step >>= 1;
        memcpy(arr + step, src + step, (size - step) * sizeof *arr);
    }
}

It relies on the caller to provide appropriate scratch space, at least equal in size to the input array. It uses the buffer-flipping approach I described, which makes it a bit more complex than it might be, but saves some copying. An alternative would be to always sum from arr into scratch, and then copy back the changes on each iteration, but that would involve asymptotically more copying (O(n log n) instead of O(n)).

I have also removed the do-nothing iterations of the inner loop. They make sense for the computing environment assumed by Hillis & Steele, but not for OpenMP. And I have restructured a bit to remove the need for FP operations.

Collectives™ on Stack Overflow

Prefix Sum Parallel Algorithm

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related