1

Try to find solution for this: Cuda Kernel uses several device functions, some of them need to return array.

I try to do this:

__device__ float *MatProd2dWxC(float *a2d, float  *b2d, int mGl, int nGl)
{
    int aRows = mGl; int aCols = nGl;
    int bRows = nGl; int bCols = 1;

    float *result;
    //result.resize(mGl*aRows);

    for (int i = 0; i < aRows; ++i) // each row of a
        for (int j = 0; j < bCols; ++j) // each col of b
            for (int k = 0; k < aCols; ++k) 
                result[i*mGl + j] += a2d[i*mGl + k] * b2d[k*mGl + j];

    return result;
}

Don't compile this because understand that pointer in function name is not good idea. But how to do right, as idea to include additional temp array and change function to void. but then I need to use it many times in kernel code, look for more elegant solution.

1 Answer 1

2

Returning a pointer from a device function is ok and works just fine.

The problem in your code is that you do not assign any value to the result pointer which you then dereference and also return from the function later. You need to use float *result = malloc(mGl*aRows * sizeof(float)); to allocate memory (and don't forget to free() later!).

However a better design would be to pass an already allocated pointer into your device function. This establishes clear ownership of the allocation (i.e. makes it clear in your code where free() should be called), and may avoid unnecessary allocation in some cases, where e.g. the allocation could be pulled outside of a loop.

This problem has nothing to do with CUDA, it applies as well to standard C.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank for answer. this function would be used in every thread so result need for every thread independent, so is it possible to allocate it not from __host__ but from __device__ side to exclude collisions?
To be more detailed in Kernel need to calculate the next formula ft2d = MatSig2d(MatSum2d(MatProd2dWx(Wf2d, xt2d,m,n), MatProd2dUh(Uf2d, h_prev2d.m,n), bf2d)); , so idea to put result directly in formula
I think, you are right, make 3d array: 2d +1 additional dimension for threads
If you pass the pointer into the function (rather than malloc()ing it inside and returning it), then you can use an automatic array, which gets allocated on the stack. This is much more efficient than using an extra dimension for the thread it, because the stack allocation will only be used for all concurrently running threads, while your 3d array will allocate memory for all threads. Of course, allocating on the stack limits you to fairly small arrays as well. You might need to increase the stack size in that case.
I need array near 2500 elements, this 50x50 2d array, is this suitable for stack? and how to pass pointer into the function in some more words, if I need to send size of array through kernel to function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.