How can I further optimize this OpenCL Kernel

Question

I am trying to implement hausdorff distance in OpenCL and the following kernel forms the basis for it or I presume it does as I still have to implement it completely. That said, can I get some suggestions or is there a way to optimize this kernel? Basically how can I remove the for loop in the kernel function which calls the helper function.....

OpenCL Kernel and its helper function:

void helper( int a_1, __global int* b_1, __global int* c_1 ){
        int i = get_global_id(0);
        c_1[i] = a_1 - b_1[i];
}

__kernel void test_call( __global int* a,             //input buffer of size [100000, 1]
                         __global int* b,             //input buffer of size [100000, 1]
                         __global int* c ){           //output buffer of size [100000, 1]
        for ( int iter = 0 ; iter < 100000 ; iter++ ){
                helper ( a[iter], b, c );
                // once array c is obtained by calling the above function,
                // it will be used in further processing that will take place inside 
                // this for loop itself
}

Essentially what I am trying to do here is to subtract each element in the input buffer 'a' with each element in the input buffer 'b'. The complexity will be O(n²).

By the way, this naive implementation itself produces results within 2.5 seconds. A serial implementation of this will take few minutes to complete the execution.

I suggest to remove the function call to helper and make it inline. Additionally, use pragma to unroll the for loop to let the GPU exploit ILP more effectively. — lashgar
– lashgar, Commented Oct 23, 2012 at 21:10

ahartel · Accepted Answer · 2012-10-24 11:09:33Z

3

I guess, your code can still be improved by using a tiling scheme. In your current implementation all work items load all values of buffer 'a'. At the moment they do so in an unsynchronized fashion. with the tiling scheme you can probably make better use of your cache architecture, by having the device only load each value of 'a' once from off-chip memory.

This scheme is best explained here: http://software.intel.com/file/24571 (PDF file).

The Tiled Parallel Approach could look like the following pseudo-code, in your case.

forall values of b in parallel {
    foreach tile q {
        forall work_items p in work_group in parallel {
            local[p] = a[q*tile_size + p]
        }
        synchronize work items in work group
        foreach body j in tile q {
            c[i] = local[j] - b[i];
        }
        synchronize work items in work group
    }
}

The key idea is that every work_item can make use of the values of buffer 'a' that have already been loaded by other work_items in the same group. Every entry of 'a' would then (ideally) only get fetched once from memory and tile_size-1 times from a cache.

edited Oct 24, 2012 at 11:09

answered Oct 24, 2012 at 10:50

ahartel

663 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dkg Over a year ago

Using the cache memory will improve the execution time (on a gpu at least). You could also consider your input data as vectors to exploit SIMD instructions (Not every openCL compiler will vectorize your kernel automatically).

Yash Over a year ago

@user1083498, I combined WarfarA's suggestion with tiling. I m getting faster execution time than befor. Thank you.

Collectives™ on Stack Overflow

How can I further optimize this OpenCL Kernel

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related