1

Just a few doubts about CUDA. Perhaps they may seem silly questions; I apologize for it.

If I declare a variable on the GPU (e.g., an array alpha with N elements, cudaMalloc((void**)&alpha, N * sizeof(double))) and allocate its value in a global function without freeing its memory, this variable should be available for other successive global functions, right?

Besides, is it possible (or advisable) to compute a scalar variable on the GPU and make it shared among several global functions on the GPU, or is it better to pass it every time as an argument from the CPU?

Thank you for your attention.

1
  • Thank you for all your answers. I am not able to compute all my project on the GPU as I have a recursive loop. Hence, my aim would be to compute some constant arrays on the GPU outside this loop, without deallocating the memory used, and some constant scalars on the CPU. Then, inside the recursive loop, I will use global functions as long as I can. Commented Dec 29, 2012 at 20:35

3 Answers 3

1

Yes, if you write values into allocated global memory those values are persistent until you free that memory, even across kernel invocations.

As for accessing scalar variables (which are constants), the better approach would be to pass it as a parameter to the global kernel launch rather than putting it in global memory and reading it from there. Global memory accesses are expensive, and this avoids loading that scalar from global memory every time you need to read it.

Sign up to request clarification or add additional context in comments.

1 Comment

Kernel parameters are also stored in global memory, but they get read through the constant cache. If the parameter is a pointer to an array, the array itself can also be read through the constant cache by using the const qualifier in the kernel parameter list.
1

If I get your question right, you're allocating an array, fill the array in a global kernel function on GPU and then process the values of that array in another kernel call.

As long as you do not free the allocated array, its values remain on global memory. So you can do that and process the same array without copying it back to CPU. Dividing jobs between multiple kernel call may come handy when you have execution time limit or one of kernel functions are in a library. But in most other cases, it seems better do all jobs in one function call.

It also seems better to pass the scalar value as an argument, because reading it from global memory has a very higher overhead.

Comments

1

If I declare a variable on the GPU (e.g., an array alpha with N elements, cudaMalloc((void**)&alpha, N * sizeof(double))) and allocate its value in a global function without freeing its memory, this variable should be available for other successive global functions, right?

You cannot call cudaMalloc() from a global function (kernel). It's a host function. You can use malloc() and new in kernels, but that can be inefficient.

You can use the same array in multiple kernels, for instance, you might perform multiple calculation steps with different kernels.

Besides, is it possible (or advisable) to compute a scalar variable on the GPU and make it shared among several global functions on the GPU, or is it better to pass it every time as an argument from the CPU?

If you pass a constant as an argument to a kernel, it is very efficiently shared among all the threads. So, it will typically be much more efficient to calculate parameters on the CPU and pass them to the kernel.

If there is a lot of parallel calculation that goes into creating the scalar, then it would be best to calculate it with a separate kernel, pass it back to the host and then pass it to the next kernel as an argument. Alternatives would just increase the complexity of your code without any performance benefit.

If there is little computation required for the scalar, then it doesn't make sense to calculate it with a kernel. Also, remember that there is no guarantee as to which order the blocks are launched in the kernel, so you would have to create a separate code path in the kernel to set up the scalar and then do costly thread index testing and synchronization to calculate the scalar and make it available to all the threads.

1 Comment

Thank you. About the use of cudaMalloc(), I didn't explain it well in my question, but I meant what you say (i.e., allocate the memory space on the GPU in the main function, and not in a global one).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.