The same OpenCL program is compiled on different OpenCL devices, possibly on different platforms. For each device a command queue is created. So for example there could be two queues, one for CPU and one for GPU.
Is it possible to call clEnqueueNDRangeKernel and then clEnqueueReadBuffer (blocking) on the two command queues, from different host threads (one for each command queue)?
For example using OpenMP, with a loop like
// queues_ contains command queues for different contexts,
// each with one device on one platform (e.g. CPU and GPU)
#pragma omp parallel for num_threads(2) schedule(dynamic)
for(int i = 0; i < job_count; ++i) {
cl::CommandQueue& queue = queues_[omp_get_thread_num()];
// queue is for one device on one platform
// euqueue kernel, and read buffer on queue
}
This would divide the job list into two chunks for CPU and GPU. schedule(dynamic) would make it so that the scheduling dynamically adapts to the execution times of the kernels.
The host code would spend most time waiting for the kernel (in the blocking clEnqueueReadBuffer call.) But thanks to the CPU device, the CPU would actually be busy executing the kernel (in OpenCL), and at the same time waiting for the GPU to finish (in the host code).