cuda device function that return array

Question

Try to find solution for this: Cuda Kernel uses several device functions, some of them need to return array.

I try to do this:

__device__ float *MatProd2dWxC(float *a2d, float  *b2d, int mGl, int nGl)
{
    int aRows = mGl; int aCols = nGl;
    int bRows = nGl; int bCols = 1;

    float *result;
    //result.resize(mGl*aRows);

    for (int i = 0; i < aRows; ++i) // each row of a
        for (int j = 0; j < bCols; ++j) // each col of b
            for (int k = 0; k < aCols; ++k) 
                result[i*mGl + j] += a2d[i*mGl + k] * b2d[k*mGl + j];

    return result;
}

Don't compile this because understand that pointer in function name is not good idea. But how to do right, as idea to include additional temp array and change function to void. but then I need to use it many times in kernel code, look for more elegant solution.

tera · Accepted Answer · 2018-07-02 12:23:23Z

2

Returning a pointer from a device function is ok and works just fine.

The problem in your code is that you do not assign any value to the result pointer which you then dereference and also return from the function later. You need to use float *result = malloc(mGl*aRows * sizeof(float)); to allocate memory (and don't forget to free() later!).

However a better design would be to pass an already allocated pointer into your device function. This establishes clear ownership of the allocation (i.e. makes it clear in your code where free() should be called), and may avoid unnecessary allocation in some cases, where e.g. the allocation could be pulled outside of a loop.

This problem has nothing to do with CUDA, it applies as well to standard C.

answered Jul 2, 2018 at 12:23

tera

7,3151 gold badge24 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Geo Evclid Over a year ago

Thank for answer. this function would be used in every thread so result need for every thread independent, so is it possible to allocate it not from __host__ but from __device__ side to exclude collisions?

Geo Evclid Over a year ago

To be more detailed in Kernel need to calculate the next formula ft2d = MatSig2d(MatSum2d(MatProd2dWx(Wf2d, xt2d,m,n), MatProd2dUh(Uf2d, h_prev2d.m,n), bf2d)); , so idea to put result directly in formula

Geo Evclid Over a year ago

I think, you are right, make 3d array: 2d +1 additional dimension for threads

tera Over a year ago

If you pass the pointer into the function (rather than malloc()ing it inside and returning it), then you can use an automatic array, which gets allocated on the stack. This is much more efficient than using an extra dimension for the thread it, because the stack allocation will only be used for all concurrently running threads, while your 3d array will allocate memory for all threads. Of course, allocating on the stack limits you to fairly small arrays as well. You might need to increase the stack size in that case.

Geo Evclid Over a year ago

I need array near 2500 elements, this 50x50 2d array, is this suitable for stack? and how to pass pointer into the function in some more words, if I need to send size of array through kernel to function.

Collectives™ on Stack Overflow

cuda device function that return array

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related