0

I am using OpenCL (Xcode, Intel GPU), and I am trying to implement a kernel that calculates moving averages and deviations. I want to pass several double arrays of varying lengths to the kernel. Is this possible to implement, or do I need to pad smaller arrays with zeroes so all the arrays are the same size?

I am new to OpenCL and GPGPU, so please forgive my ignorance of any nomenclature.

1 Answer 1

2

You can pass to the kernel any buffer, the kernel does not need to use it all. For example, if your kernel reduces a buffer you can query at run time the amount of work items (items to reduce) using get_global_size(0). And then call the kernel with the proper parameters.

An example (unoptimized):

__kernel reduce_step(__global float* data)
{
    int id = get_global_id(0);
    int size = get_global_size(0);
    int size2 = size/2;
    int size2p = (size+1)/2;
    if(id<size2) //Only reduce up to size2, the odd element will remain in place
       data[id] += data[id+size2p];
}

Then you can call it like this.

void reduce_me(std::vector<cl_float>& data){
    size_t size = data.size();

    //Copy to a buffer already created, equal or bigger size than data size
    // ... TODO, check sizes of buffer or change the buffer set to the kernel args.
    queue.enqueueWriteBuffer(buffer,CL_FALSE,0,sizeof(cl_float)*size,data.data());

    //Reduce until 1024
    while(size > 1024){
        queue.enqueueNDRangeKernel(reduce_kernel,cl::NullRange,cl::NDRange(size),cl::NullRange);
        size /= 2; 
    }

    //Read out and trim
    queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(cl_float)*size,data.data());
    data.resize(size);
}
Sign up to request clarification or add additional context in comments.

3 Comments

so I can't really easily tell from khronos specs, but does the get_global_size(n) get element n from the globalWorkSize [ ]? And how do you use the get_global_size() function to accomplish different things? Any tutorials or examples you know that would help.My ideal situation would be to pass a large constant of arrays, each containing variable length arrays. I had the idea to standardize the arrays' length by padding with zeroes and fold it into one extremely large array, then use modulus operator in kernel to derive where arrays change. But this seems expensive to memory? Thanks
You don't need to pass all the data in asingle array you can separate it in different arrays if needed. If you need to perform different operation on different arrays then separate them in diferent kernels. Passing 20 buffers with 20 different lengths to do 20 different processes in a single kernel is non only hard to do but unoptimal, since you will have many code paths in order to achieve it. What I depict in my code is an example kernel that does a process to a buffer, the size is known at execution time and there is no branching.
Well I am doing moving averages / deviations, for 5 different time periods, to probably about a billion data points. I was trying to come up with some clever ways to avoid having to call the kernel for every data point (as the averages are based on previous data points), as I am sure this has decent overhead price. Like I said my thought was to fold up a number of examples each into the different time period arrays, and then execute with integer div. I will test performance to see. Was also thinking I could do something with work groups / dimensions / whatever, but I'm not yet educated enough

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.