Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
- All threads may read the value of
sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
- Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.