OpenCL data parallel summation into a variable

Question

Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?

Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following

    __kernel void summation(__global float* input, __global float* sum)
{
    int idx = get_global_id(0);

    sum[0] += input[idx];
}

When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.

Overwritten but not updated ? What does that even mean to you ? By the way this kind of operation is exactly what cannot be done in parallel, what if sum[0] get read by one work item just before being written by another ? the first change will just go to trash — Archie
– Archie, Commented Jan 29, 2016 at 8:17
@HubertApplebaum Thanks for the tip! Yeah I know that for a vector of size 16 it seems to not be worth the trouble to use OpenCL. I was just curious if the concept would work. And I have nothing against reductions, just thought that the coding would be less and easier to read? — Batko
– Batko, Commented Jan 29, 2016 at 8:31
@Guiroux: Sorry if I was a bit unclear. What I meant with overwritten is what you refer to as the first change will just go to trash. I guess that the reduction is the safest and fastest way to do it, and to ignore this way of thinking? Thank you for your comments! — Batko
– Batko, Commented Jan 29, 2016 at 8:35

RobClucas · Accepted Answer · 2016-01-29 08:58:05Z

3

Let me try to explain why sum[0] is overwritten rather than updated.

In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.

Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].

There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:

All threads may read the value of sum[0] before any other thread writes their updated result back to sum[0], in which case the final result of sum[0] would be the value of input[idx] of the thread which executed the slowest. Since this will be different each time, if you run the example multiple times you should see different results.
Or, one thread may execute slightly more slowly, in which case another thread may have already written an updated result back to sum[0] before this slow thread reads sum[0], in which case there will be an addition using the values of more than one thread, but not all threads.

So how can you avoid this?

Option 1 - Atomics (Worse Option):

You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).

Option 2 - Reduction (Better Option):

The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.

edited Jan 29, 2016 at 8:58

answered Jan 29, 2016 at 8:46

RobClucas

8557 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Batko Over a year ago

I suspected it might be the issue you just described. Thanks!

RobClucas Over a year ago

No problem, glad that helped!

huseyin tugrul buyukisik · Accepted Answer · 2016-01-29 21:28:48Z

0

Option 3 (and worst of all)

    __kernel void summation(__global float* input, __global float* sum)
{
    int idx = get_global_id(0);
    for(int j=0;j<N;j++)
    {
        barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
        if(idx==j)
         sum[0] += input[idx];
        else
         doOtherWorkWhileSingleCoreSums();

    }
}

using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.

A cpu device could be better than gpu for this kind.

answered Jan 29, 2016 at 21:28

huseyin tugrul buyukisik

12k6 gold badges53 silver badges109 bronze badges

Collectives™ on Stack Overflow

OpenCL data parallel summation into a variable

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related