I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?