1

The same OpenCL program is compiled on different OpenCL devices, possibly on different platforms. For each device a command queue is created. So for example there could be two queues, one for CPU and one for GPU.

Is it possible to call clEnqueueNDRangeKernel and then clEnqueueReadBuffer (blocking) on the two command queues, from different host threads (one for each command queue)?

For example using OpenMP, with a loop like

// queues_ contains command queues for different contexts,
// each with one device on one platform (e.g. CPU and GPU)
#pragma omp parallel for num_threads(2) schedule(dynamic)
for(int i = 0; i < job_count; ++i) {
    cl::CommandQueue& queue = queues_[omp_get_thread_num()];
    // queue is for one device on one platform
    // euqueue kernel, and read buffer on queue
}

This would divide the job list into two chunks for CPU and GPU. schedule(dynamic) would make it so that the scheduling dynamically adapts to the execution times of the kernels. The host code would spend most time waiting for the kernel (in the blocking clEnqueueReadBuffer call.) But thanks to the CPU device, the CPU would actually be busy executing the kernel (in OpenCL), and at the same time waiting for the GPU to finish (in the host code).

2
  • 2
    To your question, the answer is yes. Commented Jan 25, 2017 at 0:19
  • 1
    I created a real-time ray tracer that balanced the load between two GPUS and CPUs. I did this by creating a thread for each device with their own context. I used pthreads for this (because I did not know OpenMP at the time) but I assume OpenMP would work just as well. I created a separate context because I did not get separate queues working with Nvidia GPUs at the time (OpenCL 1.1) and in a Nvidia forum someone at Nvidia recommended creating a different context for each device each with their own thread. Commented Jan 25, 2017 at 10:31

1 Answer 1

1

If contexts are different too, then they work independently, even with 3D applications. Depending on implementation, two contexts could be preempted or ultra threaded by drivers but you can further add event based synchronization between contexts such that one item in queue-a waits for an item completion in queue-b.

If they live in same context, you can do implicit synchronization between two queues with drivers or apis performant manipulations.

Using all cores of cpu for a memory bound kernel doesnt let it do array copying to and from gpu fast enough unless you use direct memory accessing When copying which sets cpu free of copying instruction. If cache is big and fast enough maybe it doesnt need such thing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.