2

Currently I am using C++ 11 async feature to create additional thread to run my computing kernel. The computing kernel are totally independent with each other. I want to know 2 things.

  1. Is this computing model suitable for using GPU to optimise?
  2. If question 1 is true, what is the basic practice for this kind of optimisation?

Pseudocode code is as below:

vector<std::future<ResultType>> futureVector;
for (int i = 0; i < std::thread::hardware_concurrency(); i ++) {
    auto future = std::async(
    std::launch::async,
        &computingKernel,
        this,
        parameter1,
        parameter2);
    futureVector.push_back(move(future));
}

for (int i = 0 ; i < futureVector.size(); i++) {
    // Get result
    futureVector[i].get();
}

Addition:

  1. Is there a way to move this easily without changing the whole code? Like a program mark that could start threads on GPU
6
  • 1
    No, N/A, and No., CUDA programming doesn't work anything like you imagine Commented Feb 23, 2018 at 14:37
  • @talonmies So you mean the only way to optimise it using GPU is to rewrite this part in CUDA, right? Commented Feb 23, 2018 at 14:49
  • @talonmies I am going through openACC. Do you think this is something good for my purpose? Commented Feb 23, 2018 at 14:51
  • 1
    On your first point -- not really. The code you have shown wouldn't even exist in a CUDA implementation. What you would have is a re-written computingKernel. Despite what you might imagine, GPUs don't run threads in anything like the way that pseudocode assumes Commented Feb 23, 2018 at 14:58
  • 1
    @talonmies Thanks for your comment. Actually the computingKernel is running 100 millions times in my use case.So this is why I want to use GPU to accelerate it. But it is running as a producer&consumer model in typical multicore/SMP architecture and it takes lots of time. I will try to re-implement the computing kernel to be GPU compatible to try it Commented Feb 23, 2018 at 15:15

1 Answer 1

2

Is this computing model suitable for using GPU to optimise?

No. Well, mostly no.

With a GPU, you don't schedule single-thread tasks or kernels independently and explicitly wait for each to conclude. You tell the GPU to run your kernel with N threads (and N can be very large); the kernel is, of course, the same piece of code but behavior differs according to the thread index; and you wait for the execution of all threads to conclude.

Actually it's a bit more complicated (e.g. thread indices are 3-dimensional, and groupings of threads have special meaning) but that's basically it.

So, the computing model for a GPU has some similarity and some dissimilarity to this one.

If question 1 is true, what is the basic practice for this kind of optimisation?

You can find a basic example of launching a CUDA kernel here (or the same program but with the official, underlying, C-style API here).

Note that it's possible to launch CUDA kernels asynchronously. The execution of single threads is mostly-asynchrnous anyway, but the CPU threads can choose to not wait for the execution on the GPU to conclude.

Is there a way to move this easily without changing the whole code? Like a program mark that could start threads on GPU

No. But there is the initiative of Parallel STL, with an intention of having that be able to make use of GPUs as well. See this talk at CppCon 2017.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.