1

suppose I have the following serial C:

int add(int* a, int* b, int n)
{
    for(i=0; i<n; i++)
    {
        for(j=0; j<n; j++)
        {
            a[i][j]+=b[i][j];
        }
    }

    return 0;
}

I think the best way to paralellise it is to realise it is a 2D problem and use 2D thread blocks as per CUDA kernel - nested for loop

With that in mind I started writing my cuda kernal like this:

__global__ void calc(int **A, int **B, int n)
{

    int i= blockIdx.x * blockDim.x + threadIdx.x;
    int j= blockIdx.y * blockDim.y + threadIdx.y;


    if (i>=n || j>=n)
        return;

    A[i][j]+=B[i][j];


}

nvcc tells me that:

./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space  

1) I am correct with my philosphy? 2) I think I understand blocks, thread etc but I don't understand what

    int i= blockIdx.x * blockDim.x + threadIdx.x;
    int j= blockIdx.y * blockDim.y + threadIdx.y;

does

3) Is this the most efficient/fastest way of performing operations on a 2D array in general? i.e not just matrix addition it could be any "element by element" operation.

4) Will I be able to call it from matlab? normally it freaks when the prototype is of the form type** var

Thanks guys

2
  • I don't even think your code is valid C, let alone CUDA -- how is the compiler to determine the offsets if it doesn't know the dimensions of each row? Beware the difference between 2D arrays and pointers to pointers! Commented Feb 3, 2012 at 1:38
  • @harrism I was copy/pasting from multiple source files and got the wrong function prototype Commented Feb 5, 2012 at 19:43

2 Answers 2

6

The compiler warnings you are getting come from the fact that on older GPUs, the memory structure is not "flat". The compiler can't know what memory space the addresses held by the pointer arrays your kernel is working in are. So it is warning you that it is assuming the operation is being peforming in global memory. If you compile the code for a Fermi card (sm_20 or sm_21 architecture), you won't see that warning because the memory model on those cards is "flat", and pointers are correctly interpreted by the hardware at runtime. The compiler doesn't need to handle it at compile time.

To answer each of your questions:

  1. Yes. And no. The overall idea is about 90% right, but there are several implementation issues which will become apparent from the answers which follow.

  2. CUDA C has built in variables to allow each thread to determine its "coordinates" in the execution grid which it is running, and the dimensions of each block and the grid itsef. threadIdx.{xyz} provides the thread coordinates within a block, and blockIdx.{xyz} the block coordinate with the grid. blockDim.{xyz} and gridDim.{xyz} provide the dimensions of the block and the grid, respectively (note not all hardware supports 3D grids). CUDA uses column major order for numbering threads within each block and block within each grid. The calculation you are querying is computing the equivalent {i,j} coordinate in a 2D grid using the thread and block coordinates and the block size. This is discussed in some detail in the first few pages of the "Programming model" chapter of the CUDA programming guide.

  3. No, and I say that for two reasons.

    Firstly, using arrays of pointers for memory access is not a good idea in CUDA. Two levels of pointer indirection hugely increases the latency penalty to get to your data. The key difference in a typical GPU architecture compared to a modern CPU achitecture is the memory system. GPUs have stunningly high peak memory bandwidth, but very high access latency, whereas CPUs are designed for minimal latency. So having to read and indirect two pointers to fetch a value from memory is a very big performance penalty. Store your 2D array or matrix in linear memory instead. This is what BLAS, LAPACK and Matlab do anyway.

    Secondly, every thread in your code is performing four integer arithmetic operations of setup overhead (the index calculations) for every one "productive" integer operation (the addition). There are strategies to reduce that, usually involving having each thread process more than one array element.

    If I was to write a kernel for that operation I would do it something like the code at the bottom of my answer. This uses linear memory and a 1D grid. A suitable number of threads to properly occupy the GPU process the whole input array, with each thread processing many inputs.

  4. No. As I mentioned earlier in my answer, Matlab uses linear memory to store matrices, not an array of pointers. This doesn't match the layout your kernel code is expecting.

Sample code:

__global__ void calc(int *A, int *B, int N)
{

    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int s = blockDim.x * gridDim.x;

    for( ; i<N; i+=s) {
        A[i] += B[i];
    }
}
Sign up to request clarification or add additional context in comments.

Comments

1

I am assuming you are working with n-by-n, row major order array. Try the following :

__global__ void calc(int *A, int *B, int n)
{
    int i= blockIdx.x * blockDim.x + threadIdx.x;
    int j= blockIdx.y * blockDim.y + threadIdx.y;

    if (i<n && j<n) {
        A[i*n+j] += B[i*n+j];
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.