CUDA Where To Declare Constant For Shared Memory Allocation

Question

I am running a fitness function for 1024 matrices, each matrix gets its own block and is the same size. Each block has n*n threads (the dimension of the matrix) and needs to have n*n shared memory so that I can do an easy sum reduction. However, the dimension n for all the matrices is variable before runtime (ie. it can be manually changed, though always a power of 2 so the summation is simple). The problem here is that shared memory must be allocated using a constant, but I also need the value to pass to the kernel from the host. Where do I declare the dimension n so that it is visible to the CPU (for passing to the kernel) and can be used to declare the size of the shared memory (within the kernel)?

My code is structured like this:

from main.cu I call the kernel:

const int num_states = 1024
const int dimension = 4

fitness <<< num_states, dimension * dimension >>> (device_array_of_states, dimension, num_states, device_fitness_return);

and then in kernel.cu I have:

__global__ void fitness(
    int *numbers, 
    int dimension, 
    int num_states, 
    int *fitness_return) {
    __shared__ int fitness[16]; <<-- needs to be dimension * dimension
    //code
}

numbers is an array representing 1024 matrices, dimension is the row and column length, num_states is 1024, fitness_return is an array with length 1024 that holds the fitness value for each matrix. In the kernel, the shared memory is hard coded with the square of dimension (so dimension is 4 in this example).

Where and how can I declare dimension so that it can be used to allocate shared memory as well as call the kernel, this way I only have to update dimension in one place? Thanks for your help.

Edited my answer.

einpoklum
– einpoklum

2016-12-30 00:40:09 +00:00
Commented Dec 30, 2016 at 0:40 — einpoklum
– einpoklum, Commented Dec 30, 2016 at 0:40
Declare it at global scope, before you use it.

Robert Crovella
– Robert Crovella

2016-12-30 01:33:22 +00:00
Commented Dec 30, 2016 at 1:33 — Robert Crovella
– Robert Crovella, Commented Dec 30, 2016 at 1:33
template parameters are your friend in this case

talonmies
– talonmies

2016-12-30 06:12:13 +00:00
Commented Dec 30, 2016 at 6:12 — talonmies
– talonmies, Commented Dec 30, 2016 at 6:12

einpoklum · Accepted Answer · 2020-12-31 16:43:49Z

6

The amount of allocated shared memory is uniform over all blocks. You might be using a different amount of shared memory in each block, but it's still all available. Also, the amount of shared memory is rather limited regardless, so n*n elements cannot exceed the maximum amount of space (typically 48KiB); for float-type elements (4 bytes each) that would mean n < 340 or so.

Now, there are two ways to allocate shared memory: Static and Dynamic.

Static allocation is what you gave as an example, which would not work:

__shared__ int fitness[16];

in these cases, the size must be known at compile-time (at device-side code compile time) - which is not the case for you.

With Dynamic shared memory allocation, you don't specify the size in the kernel code - you leave it empty and prepend extern:

extern __shared__ int fitness[];

Instead, you specify the amount when launching the kernel, and the threads of the different blocks don't necessarily know what it is.

But in your case, the threads do need to know what n is. Well, just pass it as a kernel argument. So,

__global__ void fitness(
    int *numbers, 
    int dimension, 
    int num_states, 
    int *fitness_return,
    unsigned short fitness_matrix_order /* that's your n*/) 
{
    extern __shared__ int fitness[];
    /* ... etc ... */
}

nVIDIA's Parallel-for-all blog has a nice post with a more in-depth introduction to using shared memory, which specifically covers static and dynamic shared memory allocation.

edited Dec 31, 2020 at 16:43

answered Dec 30, 2016 at 0:11

einpoklum

138k86 gold badges446 silver badges918 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

xjtc55 Over a year ago

I think maybe I mistated my question. The value of n is the same across all blocks.

xjtc55 Over a year ago

If I don't declare the size of the shared memory fitness I get the error MSB3721

einpoklum Over a year ago

@xjtc55: I'm not sure what that means, it looks like an MSVC error (which I don't use). Also, you haven't given a (minimal) program triggering this error - that would be a separate question, I think.

tera Over a year ago

@ xjtc55: Can you retry? I've inserted a missing extern declaration.

tera Over a year ago

Keep in mind the third kernel launch argument is the size of th dynamically shared memory in bytes, not the number of elements. Also be aware there can only be one extern __shared__ array per kernel (any additional dynamically sized shared memory array would get allocated to the same memory). For more specific advice you'd need to show the full actual code.

|

Collectives™ on Stack Overflow

CUDA Where To Declare Constant For Shared Memory Allocation

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related