Convert C++ Async function to GPU compute

Question

Currently I am using C++ 11 async feature to create additional thread to run my computing kernel. The computing kernel are totally independent with each other. I want to know 2 things.

Is this computing model suitable for using GPU to optimise?
If question 1 is true, what is the basic practice for this kind of optimisation?

Pseudocode code is as below:

vector<std::future<ResultType>> futureVector;
for (int i = 0; i < std::thread::hardware_concurrency(); i ++) {
    auto future = std::async(
    std::launch::async,
        &computingKernel,
        this,
        parameter1,
        parameter2);
    futureVector.push_back(move(future));
}

for (int i = 0 ; i < futureVector.size(); i++) {
    // Get result
    futureVector[i].get();
}

Addition:

Is there a way to move this easily without changing the whole code? Like a program mark that could start threads on GPU

No, N/A, and No., CUDA programming doesn't work anything like you imagine — talonmies
– talonmies, Commented Feb 23, 2018 at 14:37
@talonmies So you mean the only way to optimise it using GPU is to rewrite this part in CUDA, right? — yue you
– yue you, Commented Feb 23, 2018 at 14:49
@talonmies I am going through openACC. Do you think this is something good for my purpose? — yue you
– yue you, Commented Feb 23, 2018 at 14:51
On your first point -- not really. The code you have shown wouldn't even exist in a CUDA implementation. What you would have is a re-written computingKernel. Despite what you might imagine, GPUs don't run threads in anything like the way that pseudocode assumes — talonmies
– talonmies, Commented Feb 23, 2018 at 14:58
@talonmies Thanks for your comment. Actually the computingKernel is running 100 millions times in my use case.So this is why I want to use GPU to accelerate it. But it is running as a producer&consumer model in typical multicore/SMP architecture and it takes lots of time. I will try to re-implement the computing kernel to be GPU compatible to try it — yue you
– yue you, Commented Feb 23, 2018 at 15:15

einpoklum · Accepted Answer · 2019-11-04 22:29:16Z

Is this computing model suitable for using GPU to optimise?

No. Well, mostly no.

With a GPU, you don't schedule single-thread tasks or kernels independently and explicitly wait for each to conclude. You tell the GPU to run your kernel with N threads (and N can be very large); the kernel is, of course, the same piece of code but behavior differs according to the thread index; and you wait for the execution of all threads to conclude.

Actually it's a bit more complicated (e.g. thread indices are 3-dimensional, and groupings of threads have special meaning) but that's basically it.

So, the computing model for a GPU has some similarity and some dissimilarity to this one.

If question 1 is true, what is the basic practice for this kind of optimisation?

You can find a basic example of launching a CUDA kernel here (or the same program but with the official, underlying, C-style API here).

Note that it's possible to launch CUDA kernels asynchronously. The execution of single threads is mostly-asynchrnous anyway, but the CPU threads can choose to not wait for the execution on the GPU to conclude.

Is there a way to move this easily without changing the whole code? Like a program mark that could start threads on GPU

No. But there is the initiative of Parallel STL, with an intention of having that be able to make use of GPUs as well. See this talk at CppCon 2017.

Collectives™ on Stack Overflow

Convert C++ Async function to GPU compute

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related