Running parallel OpenCL kernels

Question

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.

In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.

My question is this: How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?

Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.

Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?

(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)

mfa · Accepted Answer · 2014-12-02 03:30:31Z

3

Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".

This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.

The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.

answered Dec 2, 2014 at 3:30

mfa

5,0972 gold badges25 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Zack Newsham Over a year ago

Thanks for the answer, it turns out I dont need to run multiple kernels at all, simply pass in a some global data all instances of a kernel access, then pass in the custom data in an array. However, I'm mostly interested in determining how many instances of a kernel can be run in parallel - as far as I understand this has to do with the amount of resources used by the kernel. Is there a way of checking this?

Gowtham Over a year ago

I think its safe to use no. of queues <= no. of compute units.

ollj · Accepted Answer · 2016-03-22 00:13:53Z

In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels. Swapping of Kernels is why large Kernels are better than tiny Kernels.

Single-Kernel Client computeunits are the default. Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.

Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.

Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

user2746401 · Accepted Answer · 2014-12-02 13:34:23Z

How many kernels can a graphics card run in parallel?

You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.

If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).

Can each kernel allocate "local" memory?

Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.

In which case, how do you determine how many work-items can run in parallel?

Same answer as the first question assuming kernel instances.

Would this even require multiple kernels, could I use work-items instead?

You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:

void subKernel0( .... )
{
    int gid = get_global_id(0);
    // etc.
}

void subKernel1( .... )
{
    int gid = get_global_id(0) - DISPATCH_SIZE_0;
    // etc.
}

__kernel uberKernel( .... )
{
    if( get_global_id(0) < DISPATCH_SIZE_0 )
    {
        subKernel0( .... );
    }
    else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
    {
        subKernel1( .... );
    }
    else if( .... )
    {
        // etc.
    }
}

The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.

According to AMD, GCN 1.0 cards can execute kernels parallel: AMD OpenCL Programming Guide, section 5.7.4: "AMD Southern Islands GPUs can execute multiple kernels simultaneously when there are no dependencies."

Collectives™ on Stack Overflow

Running parallel OpenCL kernels

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related