Using different OpenCL command queues from multiple host threads

Question

The same OpenCL program is compiled on different OpenCL devices, possibly on different platforms. For each device a command queue is created. So for example there could be two queues, one for CPU and one for GPU.

Is it possible to call clEnqueueNDRangeKernel and then clEnqueueReadBuffer (blocking) on the two command queues, from different host threads (one for each command queue)?

For example using OpenMP, with a loop like

// queues_ contains command queues for different contexts,
// each with one device on one platform (e.g. CPU and GPU)
#pragma omp parallel for num_threads(2) schedule(dynamic)
for(int i = 0; i < job_count; ++i) {
    cl::CommandQueue& queue = queues_[omp_get_thread_num()];
    // queue is for one device on one platform
    // euqueue kernel, and read buffer on queue
}

This would divide the job list into two chunks for CPU and GPU. schedule(dynamic) would make it so that the scheduling dynamically adapts to the execution times of the kernels. The host code would spend most time waiting for the kernel (in the blocking clEnqueueReadBuffer call.) But thanks to the CPU device, the CPU would actually be busy executing the kernel (in OpenCL), and at the same time waiting for the GPU to finish (in the host code).

I created a real-time ray tracer that balanced the load between two GPUS and CPUs. I did this by creating a thread for each device with their own context. I used pthreads for this (because I did not know OpenMP at the time) but I assume OpenMP would work just as well. I created a separate context because I did not get separate queues working with Nvidia GPUs at the time (OpenCL 1.1) and in a Nvidia forum someone at Nvidia recommended creating a different context for each device each with their own thread. — Z boson
– Z boson, Commented Jan 25, 2017 at 10:31

huseyin tugrul buyukisik · Accepted Answer · 2017-01-24 18:24:09Z

1

If contexts are different too, then they work independently, even with 3D applications. Depending on implementation, two contexts could be preempted or ultra threaded by drivers but you can further add event based synchronization between contexts such that one item in queue-a waits for an item completion in queue-b.

If they live in same context, you can do implicit synchronization between two queues with drivers or apis performant manipulations.

Using all cores of cpu for a memory bound kernel doesnt let it do array copying to and from gpu fast enough unless you use direct memory accessing When copying which sets cpu free of copying instruction. If cache is big and fast enough maybe it doesnt need such thing.

edited Jan 24, 2017 at 18:24

answered Jan 24, 2017 at 17:58

huseyin tugrul buyukisik

12k6 gold badges53 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Using different OpenCL command queues from multiple host threads

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related