What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
- For a CPU, you can check
cpuid, orsched_getcpu, orGetProcessorNumberin order to check which core / processor the current thread is currently executing on. - Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
- Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
- Is there an equivalent to
cpuid,sched_getcpu, orGetProcessorNumberfor GPUs for core usage monitoring, etc? Perhaps something vender architecture specific? - Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
- Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
- GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
- CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Development Environment Details:
- OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
- Compiling with MinGW
GNU GCC 4.8.1 -std=c++11 - Intel OpenCL SDK (OpenCL header, libraries, and runtime)
- According to Process Manager, Intel's OpenCL compiler is a clang variant.
- AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
- OpenCL 1.2
- I am trying to keep the source code as portable as possible.
work_group_size = 1vswork_group_size = MAX?