1

I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in __shared__ double mat1[XSIZE][YSIZE]. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc in a kernel would allocate only in global memory (assuming such a call is possible), that does not help either.

Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?

1 Answer 1

5

You can declare dynamically sized shared memory allocations in CUDA, like this

__global__ void kernel()
{
    extern __shared__ double *mat1;
}

And then launch your kernel like this

kernel<<<grid,block,XSIZE*YSIZE*sizeof(double)>>>();

This is discussed in more detail in the CUDA programming guide.

Sign up to request clarification or add additional context in comments.

2 Comments

This method allows allocation of the same amount of memory to each of the thread dynamically. I have to populate each thread with differently sized matices, sizes whose upper and lower bounds I do not know yet. But thank you very much for the reply and the reference. It is a good starting point. Yes, it has been discussed in the programming guide in section B.16, as I found out from your hint.
No, it allocates shared memory to each block dynamically. Shared memory has block scope in CUDA, not thread scope.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.