6

I am a newbie in OpenCL. However, I understand the C/C++ basics and the OOP. My question is as follows: is it somehow possible to run the sum computation task in parallel? Is it theoretically possible? Below I will describe what I've tried to do:

The task is, for example:

double* values = new double[1000]; //let's pretend it has some random values inside
double sum = 0.0;

for(int i = 0; i < 1000; i++) {
    sum += values[i];
}

What I tried to do in OpenCL kernel (and I feel it is wrong because perhaps it accesses the same "sum" variable from different threads/tasks at the same time):

__kernel void calculate2dim(__global float* vectors1dim,
                            __global float output,
                            const unsigned int count) {
    int i = get_global_id(0);
    output += vectors1dim[i];
}

This code is wrong. I will highly appreciate if anyone answers me if it is theoretically possible to run such tasks in parallel and if it is - how!

2
  • 5
    That's a classic reduction problem. Look here for a step by step explanation of optimizing this process for many-core architectures (it's CUDA, but the prinicples are exactly the same, except for the part about templates maybe). Though some more introductory material on the topic might be more helpful, but I leave that to a proper answer. Commented Mar 4, 2013 at 11:43
  • Thank you very much! Now I know it is a common problem and will learn hot to solve it! Commented Mar 4, 2013 at 12:28

2 Answers 2

1

If you want to sum the values of your array in a parallel fashion, you should make sure you reduce contention and make sure there's no data dependencies across threads.

Data dependencies will cause threads to have to wait for each other, creating contention, which is what you want to avoid to get true parallellization.

One way you could do that is to split your array into N arrays, each containing some subsection of your original array, and then calling your OpenCL kernel function with each different array.

At the end, when all kernels have done the hard work, you can just sum up the results of each array into one. This operation can easily be done by the CPU.

The key is to not have any dependencies between the calculations done in each kernel, so you have to split your data and processing accordingly.

I don't know if your data has any actual dependencies from your question, but that is for you to figure out.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the answer. Probably, I should split my array into several separate! Do you know any way to pass two-dimensonal array (like double[][]) to the kernel? Because pointet-to-pointer is not possible to use as a function argument.
You do not need to pass two-dimentional array, just pass your buffer and address it like this myarr[y*WIDTH + x] that is all.
-1

The piece of code I've provided for reference should do the job.

E.g. you have N elements, and size of your workgroup is WS = 64. I assume that N is multiple of 2*WS (this is important, one workgroup calculates sum of 2*WS elements). Then you need to run kernel specifying:

globalSizeX = 2*WS*(N/(2*WS));

As a result sum array will have partial sums of 2*WS elements. ( e.g. sum[1] - will contain sum of elements whose indices are from 2*WS to 4*WS-1).

If your globalSizeX is 2*WS or less (which means that you have only one workgroup), then you are done. Just use sum[0] as a result. If not - you need to repeat procedure, this time using sum array as input array and output to other array (create 2 arrays and ping-pong between them). And so on untill you will have only one workgroup.

Search also for Hilli Steele / Blelloch parallel algorithms. This article could be useful as well

Here is the actual example:

__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum)
{
    int li = get_local_id(0);
    int groupId = get_group_id(0);

    __local int our_h[2 * get_group_size(0)];
    our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0];
    our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1];

    // sweep up
    int width = 2;
    int num_el = 2*get_group_size(0)/width;
    int wby2 = width>>1;

    for(int i = 2*BLK_SIZ>>1; i>0; i>>=1)
    {

        barrier(CLK_LOCL_MEM_FENCE);

        if(li < num_el)
        {
            int idx = width*(li+1) - 1;
            our_h[idx] = our_h[idx] + our_h[(idx - wby2)];
        }

        width<<=1;
        wby2 = width>>1;
        num_el>>=1;
    }

        barrier(CLK_LOCL_MEM_FENCE);

    // down-sweep
    if(0 == li)
        sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.