2

I m trying to run an application of vector addition, where i need to launch multiple kernels concurrently, so for concurrent kernel launch someone in my last question advised me to use multiple command queues. which i m defining by an array

context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &err);
    for(i=0;i<num_ker;++i)
    {
queue[i] = clCreateCommandQueue(context, device_id, 0, &err);
    }

I m getting an error "command terminated by signal 11" some where around the above code.

i m using for loop for launching kernels and En-queue data too

 for(i=0;i<num_ker;++i)
 {
 err = clEnqueueNDRangeKernel(queue[i], kernel, 1, NULL, &globalSize, &localSize,
                                                          0, NULL, NULL);
 }

The thing is I m not sure where m i going wrong, i saw somewhere that we can make array of command queues, so thats why i m using an array. another information, when i m not using A for loop, just manually defining multiple command queues, it works fine.

1 Answer 1

4

I read as well your last question, and I think you should first rethink what do you really want to do and if OpenCL is really the way of doing it.

OpenCL is an API for masive parallel processing and data crunching. Where each kernel (or queued task) operates parallelly on many data values at the same time, therefore outperforming any serial CPU processing by many orders of magnitude.

The typical use case for OpenCL is 1 kernel running millions of work items. Were more advance applications may need multiple sequences of different kernels, and special syncronizations between CPU and GPU.

But concurrency is never a requirement. (Otherwise, a single core CPU would not be able to perform the task, and thats never the case. It will be slower, ok, but it will still be possible to run it)

Even if 2 tasks need to run at the same time. The time taken will be the same concurrently or not:

Not concurrent case:

Kernel 1: *
Kernel 2: -
GPU Core 1: *****-----
GPU Core 2: *****-----
GPU Core 3: *****-----
GPU Core 4: *****-----

Concurrent case:

Kernel 1: *
Kernel 2: -
GPU Core 1: **********
GPU Core 2: **********
GPU Core 3: ----------
GPU Core 4: ----------

In fact, the non concurrent case is preferred, since at least the first task is already completed and further processing can continue.


What you do want to do, as far as I understand, is run multiple kernels at the same time. So that the kernels run fully concurrently. For example, run 100 kernels (same kernel or different) and run them at the same time.

That does not fit the OpenCL model at all. And in fact in may be way slower than a CPU single thread.

If each kernel is independent to all the others, a core (SIMD or CPU) can only be allocated for 1 kernel at a time (because they only have 1 PC), even though they could run 1k threads at the same time. In an ideal scenario, this will convert your OpenCL device in a pool of few cores (6-10) that consume serially the kernels queued. And that is supposing the API supports it and the device as well, what is not always the case. In the worst case you will have a single device that runs a single kernel and is 99% wasted.

Examples of stuff that can be done in OpenCL:

  • Data crunching/processing. Multiply vectors, simulate particles, etc..
  • Image processing, border detection, filtering, etc.
  • Video compresion, edition, generation
  • Raytracing, complex light math, etc.
  • Sorting

Examples of stuff that are not suitable for OpenCL:

  • Atending async request (HTTP, trafic, interactive data)
  • Procesing low amounts of data
  • Procesing data that need completely different procesing for each type of it

From my point of view, the only real use case of using multiple kernels is the latter, and no matter what you do the performance will be horrible in that case. Better use a multithread pool instead.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much for taking your time out and replying, i totally appreciate it, well your question is valid why would I do that. the answer to that it, right now i working on a research that involes different programming models, like CUDA, openMP OpenCL, etc, to see how they switch between different tasks, how they scale, and how they reach when they are running concurrently many tasks, sorry for not being very clear of what i was trying to do, but yeah your answer explains a lot about behaviors. Thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.