I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.
In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.
My question is this: How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?
Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.
Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?
(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)