0

I am trying to implement hausdorff distance in OpenCL and the following kernel forms the basis for it or I presume it does as I still have to implement it completely. That said, can I get some suggestions or is there a way to optimize this kernel? Basically how can I remove the for loop in the kernel function which calls the helper function.....

OpenCL Kernel and its helper function:

void helper( int a_1, __global int* b_1, __global int* c_1 ){
        int i = get_global_id(0);
        c_1[i] = a_1 - b_1[i];
}

__kernel void test_call( __global int* a,             //input buffer of size [100000, 1]
                         __global int* b,             //input buffer of size [100000, 1]
                         __global int* c ){           //output buffer of size [100000, 1]
        for ( int iter = 0 ; iter < 100000 ; iter++ ){
                helper ( a[iter], b, c );
                // once array c is obtained by calling the above function,
                // it will be used in further processing that will take place inside 
                // this for loop itself
}

Essentially what I am trying to do here is to subtract each element in the input buffer 'a' with each element in the input buffer 'b'. The complexity will be O(n2).

By the way, this naive implementation itself produces results within 2.5 seconds. A serial implementation of this will take few minutes to complete the execution.

1
  • 1
    I suggest to remove the function call to helper and make it inline. Additionally, use pragma to unroll the for loop to let the GPU exploit ILP more effectively. Commented Oct 23, 2012 at 21:10

1 Answer 1

3

I guess, your code can still be improved by using a tiling scheme. In your current implementation all work items load all values of buffer 'a'. At the moment they do so in an unsynchronized fashion. with the tiling scheme you can probably make better use of your cache architecture, by having the device only load each value of 'a' once from off-chip memory.

This scheme is best explained here: http://software.intel.com/file/24571 (PDF file).

The Tiled Parallel Approach could look like the following pseudo-code, in your case.

forall values of b in parallel {
    foreach tile q {
        forall work_items p in work_group in parallel {
            local[p] = a[q*tile_size + p]
        }
        synchronize work items in work group
        foreach body j in tile q {
            c[i] = local[j] - b[i];
        }
        synchronize work items in work group
    }
}

The key idea is that every work_item can make use of the values of buffer 'a' that have already been loaded by other work_items in the same group. Every entry of 'a' would then (ideally) only get fetched once from memory and tile_size-1 times from a cache.

Sign up to request clarification or add additional context in comments.

2 Comments

Using the cache memory will improve the execution time (on a gpu at least). You could also consider your input data as vectors to exploit SIMD instructions (Not every openCL compiler will vectorize your kernel automatically).
@user1083498, I combined WarfarA's suggestion with tiling. I m getting faster execution time than befor. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.