OpenCL Verification of Parallel Execution

Question

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?

I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).

I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.

In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).

The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL

Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.

For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?

Hardware Details:

GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz

Development Environment Details:

OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.

@kchoi Can you expand a bit on that? Perhaps in an OpenCL context, that would mean work_group_size = 1 vs work_group_size = MAX? — wazzy
– wazzy, Commented Sep 18, 2013 at 22:46
This abstraction on how the code is executed is what OpenCL specs tryes to do. Why would you like to know how it is being executed internally? Even if you knew, you CAN'T change it. And even if the kernel is not using 100% of the available resources, you never know what is being used in the rest of cores. (other applications, screen refresh, etc...). You have to assume the driver will do it's best to fit the order in the device. — DarkZeros
– DarkZeros, Commented Sep 19, 2013 at 8:30
@DarkZeros Why: I want to verify the parallelism (if there is indeed any), and if there is, to what degree. It would also be good to know if the kernel or drivers are limiting how much of a device OpenCL can access, and to what degree they are "interfering", so to speak. — wazzy
– wazzy, Commented Sep 19, 2013 at 15:47
I'm afraid you will not find any tools for this inside the OpenCL spec. Since the aim is just the oposite, provide a generic enviroment that abstracts the underlying HW. Maybe you can find something into the driver manufacturer. But I really doubt they will provide anything more detailed than "GPU usage %". — DarkZeros
– DarkZeros, Commented Sep 19, 2013 at 16:01

huseyin tugrul buyukisik · Accepted Answer · 2017-02-10 16:49:29Z

Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.

Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)

If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.

Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.

Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.

Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

Collectives™ on Stack Overflow

OpenCL Verification of Parallel Execution

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related