Parallel reduction using local memory in OpenCL

Question

I implemented a reduce kernel in OpenCL to sum up all entries in the input vector of size N. For a easier testing I initialize the input vector with 1.0f. So the result should be N. But it is not!

Here is my reduce-kernel:

kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
    const uint local_id = get_local_id(0);
    const uint global_id = get_global_id(0);
    const uint local_size = get_local_size(0);

    cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
        if (local_id < s) {
            cache[local_id] += cache[local_id + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (local_id == 0) output[local_size] = cache[0];
}

And here is the setting for OpenCL:

 const uint N = 8196;

 cl_float a[N];
 cl_float b[N];

 for (uint i=0; i<N; i++) {
      a[i] = 1.0f;
      b[i] = 0.0f;
 }

 cl::Buffer inputBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*N);
 cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float)*N);

 queue.enqueueWriteBuffer(inputBuffer, CL_TRUE, 0, sizeof(cl_float)*N, a);
 queue.enqueueWriteBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b);

 cl::Kernel addVectorKernel = cl::Kernel(program, "reduce");

 size_t localSize = addVectorKernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device); // e.g. => 512

 size_t globalSize = roundUp(localSize, N); // rounds up to a multiple of localSize

 addVectorKernel.setArg(0, inputBuffer);
 addVectorKernel.setArg(1, resultBuffer);
 addVectorKernel.setArg(2, N);
 addVectorKernel.setArg(3, (sizeof(cl_float) * localSize), NULL);


 queue.enqueueNDRangeKernel(
      addVectorKernel,
      cl::NullRange,    
      cl::NDRange(globalSize), 
      cl::NDRange(localSize)     
 );
 queue.finish(); // wait for ending

 queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b); // e.g. => 1024

The result depends on the workgroup size. What am I doing wrong? Is it the kernel itself or is it the settings for OpenCL?

I think you meant to use arraySize =8192. This will be 32kb of local memory. — mfa
– mfa, Commented Feb 16, 2015 at 17:11

mfa · Accepted Answer · 2015-02-17 18:02:29Z

3

You should be using the group's id when writing the sum back to global memory.

if (local_id == 0) output[local_size] = cache[0];

That line will write to output[512] repeatedly. You need each work group to write to a dedicated location in the output.

kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
    const uint local_id = get_local_id(0);
    const uint global_id = get_global_id(0);
    const uint group_id = get_group_id(0);
    const uint local_size = get_local_size(0);

    cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
        if (local_id < s) {
            cache[local_id] += cache[local_id + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (local_id == 0) output[group_id] = cache[0];
}

Then you need to sum the values from the output on the host. Note that 'b' in the host code does not need to hold N elements. Only one element for each work group will be used.

//replace (globalSize/localSize) with the pre-calculated/known number of work groups
for (i=1; i<(globalSize/localSize); i++) {
    b[0] += b[i];
}

Now b[0] is your grand total.

edited Feb 17, 2015 at 18:02

answered Feb 17, 2015 at 17:46

mfa

5,0972 gold badges25 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Michael Dorner Over a year ago

Thank you. First of all, I think we should start with i=0 in the last mentioned look. Second, the issue sill remains...

mfa Over a year ago

i = 1 is correct.I am adding all other values to b[0]. I had a different mistake though: the body of the loop should read b[0] += b[i];

Lubo Antonov · Accepted Answer · 2015-02-17 09:35:28Z

2

In the reduction for loop, you need this:

for(unsigned int s = localSize >> 1; s > 0; s >>= 1)

You are shifting one more bit than you should when initializing s.

After that's fixed, let's look at what your kernel is doing. The host code executes it with globalSize of 8192 and localSize of 512, which results in 16 work groups. Inside the kernel you first sum the data from the two consecutive memory locations at index 2*global_id. For work group with id 15, work item 0, that will be at index 15*512*2 = 15,360 and 15,361, which is outside the boundaries of your input array. I am surprised you don't get a crash. At the same time, this explains why you have double the values that you expect.

To fix it, you can do this:

cache[localID] = input[globalID];

Or specify a global size that's half of the number of the current one.

edited Feb 17, 2015 at 9:35

answered Feb 16, 2015 at 21:41

Lubo Antonov

2,33214 silver badges18 bronze badges

2 Comments

Michael Dorner Over a year ago

Thanks for your post, but this was just a typo and does not problem the result itself - now it is just 1024 for a workgroup size of 512.

Michael Dorner Over a year ago

Thanks a lot, but I am sorry, it does not work anyway.

Collectives™ on Stack Overflow

Parallel reduction using local memory in OpenCL

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related