Multi-GPU programming strategies using CUDA

Question

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.

Single host thread - multiple devices (shared context)
Single host thread - concurrent execution of kernels on a single device (shared context)
Multiple host threads - (Equal) Multiple devices (independent contexts)
Single host thread - Sequential kernel execution on one device
Multiple host threads - concurrent execution of kernels on one device (independent contexts)
Multiple host threads - sequential execution of kernels on one device (independent contexts)

Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.

Thanks,
Sayan

EDIT:

I thought that the previous categorization involved some redundancy, so modified it.

I would go with "Multiple host threads - (Equal) Multiple devices, Independent Contexts" because for this approach there seems to be as little sharing of data as possible so you can maximize parallelism. However I don't do GPGPU so it's just a general observation. — Roy T.
– Roy T., Commented Jul 1, 2011 at 17:57

ArchaeaSoftware · Accepted Answer · 2011-07-29 14:36:46Z

2

Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.

Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.

answered Jul 29, 2011 at 14:36

ArchaeaSoftware

4,44219 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

osgx Over a year ago

what about (1) in SLI setups?

ArchaeaSoftware Over a year ago

SLI is just a special case of multi-GPU, assuming you have set up the GPUs to enumerate separately. By default, SLI makes multiple GPUs look like a single, faster GPU; but CUDA can only use one of the GPUs when the system is configured that way.

peakxu · Accepted Answer · 2011-07-03 01:36:28Z

1

Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.

Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.

answered Jul 3, 2011 at 1:36

peakxu

6,6951 gold badge31 silver badges27 bronze badges

1 Comment

Sayan Over a year ago

The multi-GPU system is an Appro 1U with 2 CPUs and 4 GPUs. I want to test all the above categories and profile them...yes, I have seen better results in that category so far, and have the notion that such problems are ideal for GPUs. Although I have a feeling that "Multiple host threads - concurrent execution of kernels on one device - independent context" might be beneficial w.r.t power consumed for certain problem sizes.

Collectives™ on Stack Overflow

Multi-GPU programming strategies using CUDA

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related