2

I am running a fitness function for 1024 matrices, each matrix gets its own block and is the same size. Each block has n*n threads (the dimension of the matrix) and needs to have n*n shared memory so that I can do an easy sum reduction. However, the dimension n for all the matrices is variable before runtime (ie. it can be manually changed, though always a power of 2 so the summation is simple). The problem here is that shared memory must be allocated using a constant, but I also need the value to pass to the kernel from the host. Where do I declare the dimension n so that it is visible to the CPU (for passing to the kernel) and can be used to declare the size of the shared memory (within the kernel)?

My code is structured like this:

from main.cu I call the kernel:

const int num_states = 1024
const int dimension = 4

fitness <<< num_states, dimension * dimension >>> (device_array_of_states, dimension, num_states, device_fitness_return);

and then in kernel.cu I have:

__global__ void fitness(
    int *numbers, 
    int dimension, 
    int num_states, 
    int *fitness_return) {
    __shared__ int fitness[16]; <<-- needs to be dimension * dimension
    //code
}

numbers is an array representing 1024 matrices, dimension is the row and column length, num_states is 1024, fitness_return is an array with length 1024 that holds the fitness value for each matrix. In the kernel, the shared memory is hard coded with the square of dimension (so dimension is 4 in this example).

Where and how can I declare dimension so that it can be used to allocate shared memory as well as call the kernel, this way I only have to update dimension in one place? Thanks for your help.

3
  • Edited my answer. Commented Dec 30, 2016 at 0:40
  • Declare it at global scope, before you use it. Commented Dec 30, 2016 at 1:33
  • template parameters are your friend in this case Commented Dec 30, 2016 at 6:12

1 Answer 1

6

The amount of allocated shared memory is uniform over all blocks. You might be using a different amount of shared memory in each block, but it's still all available. Also, the amount of shared memory is rather limited regardless, so n*n elements cannot exceed the maximum amount of space (typically 48KiB); for float-type elements (4 bytes each) that would mean n < 340 or so.

Now, there are two ways to allocate shared memory: Static and Dynamic.

Static allocation is what you gave as an example, which would not work:

__shared__ int fitness[16];

in these cases, the size must be known at compile-time (at device-side code compile time) - which is not the case for you.

With Dynamic shared memory allocation, you don't specify the size in the kernel code - you leave it empty and prepend extern:

extern __shared__ int fitness[];

Instead, you specify the amount when launching the kernel, and the threads of the different blocks don't necessarily know what it is.

But in your case, the threads do need to know what n is. Well, just pass it as a kernel argument. So,

__global__ void fitness(
    int *numbers, 
    int dimension, 
    int num_states, 
    int *fitness_return,
    unsigned short fitness_matrix_order /* that's your n*/) 
{
    extern __shared__ int fitness[];
    /* ... etc ... */
}

nVIDIA's Parallel-for-all blog has a nice post with a more in-depth introduction to using shared memory, which specifically covers static and dynamic shared memory allocation.

Sign up to request clarification or add additional context in comments.

8 Comments

I think maybe I mistated my question. The value of n is the same across all blocks.
If I don't declare the size of the shared memory fitness I get the error MSB3721
@xjtc55: I'm not sure what that means, it looks like an MSVC error (which I don't use). Also, you haven't given a (minimal) program triggering this error - that would be a separate question, I think.
@ xjtc55: Can you retry? I've inserted a missing extern declaration.
Keep in mind the third kernel launch argument is the size of th dynamically shared memory in bytes, not the number of elements. Also be aware there can only be one extern __shared__ array per kernel (any additional dynamically sized shared memory array would get allocated to the same memory). For more specific advice you'd need to show the full actual code.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.