I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
- Single host thread - multiple devices (shared context)
- Single host thread - concurrent execution of kernels on a single device (shared context)
- Multiple host threads - (Equal) Multiple devices (independent contexts)
- Single host thread - Sequential kernel execution on one device
- Multiple host threads - concurrent execution of kernels on one device (independent contexts)
- Multiple host threads - sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.