6

I created a program that needs to call a function multiple times (lots !!) with different input parameters. To speed things up, I multithreaded this like this:

std::vector< MTDPDS* > mtdpds_list;
boost::thread_group thread_gp;
for (size_t feat_index = 0; feat_index < feat_parser.getNumberOfFeat(); ++feat_index)
{
    Feat* feat = feat_parser.getFeat(static_cast<unsigned int>(feat_index));

    // != 0 has been added to avoid a warning message during compilation
    bool rotatedFeat = (feat->flag & 0x00000020) != 0;
    if (!rotatedFeat)
    {
        Desc* desc = new Desc(total_sb, ob.size());

        MTDPDS* processing_data = new MTDPDS();
        processing_data->feat = feat;
        processing_data->desc = desc;
        processing_data->img_info = image_info;
        processing_data->data_op = &data_operations;
        processing_data->vecs_bb = vecs_bb;

        mtdpds_list.push_back(processing_data);

        thread_gp.add_thread(new boost::thread(compute_desc, processing_data));
    }
}

// Wait for all threads to complete
thread_gp.join_all();

This code is a piece of a much larger code, so don't worry too much about variable names, etc... The important thing is that I create an object (MTDPDS) for each thread that contains input and output parameters, then spawn a thread calling my processing function compute_desc, and wait for all threads to complete before continuing.

However, my for loop has about 2000+ iterations, meaning that I start about 2000+ threads. I run my code on a cluster, so it's pretty fast, though it still takes too long IMO.

I would like to move this part to the GPU (as it has much more cores), though I'm new to GPU programming.

  1. Is there a way (as I already have a separated computing function) to move this easily without changing the whole code? Like a function that could start threads on GPU in a similar way as boost (like replacing boost thread with GPU thread)?
  2. Also, my computing function is accessing some data loaded in memory (RAM here), does the GPU requires to have these data loaded into GPU memory, or can it access RAM (and then in this case, which one is faster)?
  3. And one last question (though I'm pretty sure I know the answer), is it possible to make it hardware independent (so my code could run on Nvidia, ATI, etc...)?

Thank you.

3
  • If you need cross hardware support that supports multi-threading your best bet is probably learning Vulkan. It's a lot more efficient than openGL and and allows both nVidia and AMD hardware, as opposed to CUDA Commented Jul 21, 2017 at 16:59
  • From what I've seen, Vulkan is much more oriented for graphics than for computation purpose, isn't it ? Commented Jul 21, 2017 at 17:21
  • yes you're correct, maybe try looking into OpenCL? Commented Jul 21, 2017 at 19:15

1 Answer 1

3
  • 1) The simplest solution is to use #pragma directive (OpenACC) which should be already present in GCC7.

  • 2) your data should be GPU friendly, understand Structure of Array

  • 3) your compute_desc "kernel" should be GPU compliant, if you do not know let say it should vectorizable by the compiler.

I hope it will help a bit, I think a little tutorial on OpenACC tuto should the best solution for you, CUDA/OpenCL should come later. My 2 cents

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the tutorial. Though I will have to dig in the documentation, because in my case, my loop calls my processing function, which itself has loops calling functions. I think OpenACC 2.0 has something called routine that I could use, but I still haven't figured it out how to use it yet.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.