suppose I have the following serial C:
int add(int* a, int* b, int n)
{
for(i=0; i<n; i++)
{
for(j=0; j<n; j++)
{
a[i][j]+=b[i][j];
}
}
return 0;
}
I think the best way to paralellise it is to realise it is a 2D problem and use 2D thread blocks as per CUDA kernel - nested for loop
With that in mind I started writing my cuda kernal like this:
__global__ void calc(int **A, int **B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i>=n || j>=n)
return;
A[i][j]+=B[i][j];
}
nvcc tells me that:
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
1) I am correct with my philosphy? 2) I think I understand blocks, thread etc but I don't understand what
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
does
3) Is this the most efficient/fastest way of performing operations on a 2D array in general? i.e not just matrix addition it could be any "element by element" operation.
4) Will I be able to call it from matlab? normally it freaks when the prototype is of the form type** var
Thanks guys