AMD OpenCL asynchronous execution efficency

Question

For example, I have three tasks A, B, and C. Among them B and C depends on A. And there are sufficent CU's to run B and C at the same time. And then I enqueue A and C on queue0, and B on queue1. And there is a huge delay after A is finished and before B is started, which make the whole job taking longer time than using only one queue.

Is this normal? Or could I have done something wrong?

I will write a sample code if required, the original code is heavily encapsuled. But actually I just create an event when enqueuing A and pass it to the enqueuing of B, and both queues are just normal in order queue. Nothing seems to be special.

I'm using similar parallel queues and seeing latencies between event-driven steps with a HD7870 and R7-240. Then I changed the queues to this: A+B+C on same queue but duplicated so anytime there is 10x(A+B+C), 10 queue spawned and worked fast without any stutters. drivers handling best order of operations as I see from codeXL time graph — huseyin tugrul buyukisik
– huseyin tugrul buyukisik, Commented Feb 18, 2017 at 19:21
@huseyintugrulbuyukisik guess maybe this is just normal... Sometimes I just think AMD is joking us... — BlueWanderer
– BlueWanderer, Commented Feb 19, 2017 at 8:13

huseyin tugrul buyukisik · Accepted Answer · 2017-02-19 11:23:24Z

1

I couldn't find info about latencies but, to call something normal, we need statistically derived latency base for all platforms, here is mine:

HD7870 and R7-240 showing same behaviour. Windows 10. Two channel RAM. OpenCl 1.2(64 bit build). CodeXL profiling. All in-order queues. Some old drivers before crimson.

eventless single queue with non-blocking commands: several microseconds to 200 microseconds fluctuating but average must be low like 50 microseconds and depending on drivers, for some kernels it goes to 500 microseconds maybe because of too many parameters and similar preparations.
event source = single queue-A, event target = queue-B: 100-150 microseconds to half a millisecond(seemed constant)
event source = N-1 queues list, event target = queue-N: Not sum of all latencies of queues but some kind of latency hidden is there, so its not more than 2 millisecond(sometimes peaks to 3-5 milliseconds rarely)
event source = queue, waiting by clWaitForEvents from host: about a millisecond
event source = queue, waiting by clGetEventInfo from host in while-loop: nearly half a millisecond, sometimes even less
clFinish for single queue: This has most latency per queue like 1ms at least.
user events: were generating errors in codeXL so I couldn't query their performance but it was an older driver and older codeXL version.

There were background processes: avira, google chrome,.. which are advanced enough to use GPU for their purpose and may hinder kernel executions.

My solution to these were pipelining through usage of many independent queues to hide their event latencies and worked like a charm. R7-240 was running on 16-queues fine. It has only 2 ACE units so newer cards having 4-8 of them could work with more queues.

What I didn't try and wonder is: N queue waiting for completion M other queues with event list performance. Maybe tree-like waiting structure could be better for many queues if they lag too much.

edited Feb 19, 2017 at 11:23

answered Feb 19, 2017 at 10:06

huseyin tugrul buyukisik

12k6 gold badges53 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BlueWanderer Over a year ago

On my Fiji device, the delay is only 1 microsecond for synchronous execution, which make the delay in async execution like a nightmare. And I wonder if there is a right way to do async execution that there is no such delay. "Shouldn't events just marks of dependency on GPU side?" I thought. And AMD said that this is what their GPU good at, who knows...

huseyin tugrul buyukisik Over a year ago

1 us is good compared to my cards. There more than 1 Ace units So duplicating all queues should hide it

BlueWanderer Over a year ago

Guess CU's have to be left "idle" for that 1us(about 1.5us more precisely) between tasks, using multiple queue doesn't seem to hide it. Doesn't matter anyway, since each kernel takes about 50us to run, more if I widen the task to fill the CU's.

Collectives™ on Stack Overflow

AMD OpenCL asynchronous execution efficency

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related