0

I want to read a text file and store it in an array. Then, I want to transfer the array from the host to the device and store it in the shared memory. I have written the following code,but the execution time has been increased compared with using the global memory. I cannot understand what the reason can be? Also, it will be great if someone can help me write this code using constant memory.

__global__ void deviceFunction(char *pBuffer,int pSize){
    extern __shared__ char p[];
    int i;
    for(i=0;i<pSize;i++)}
        p[i] = pBuffer[i];
    }
}
int main(void){

    cudaMalloc((void**)&pBuffer_device,sizeof(char)*pSize);
    cudaMemcpy(pBuffer_device,pBuffer,sizeof(char)*pSize,cudaMemcpyHostTo Device);
    kernel<<<BLOCK,THREAD>>>(pBuffer_device,pSize);

}
3
  • The code you have posted doesn't do anything and wouldn't run even if it did. This isn't your actual code, is it? Commented Mar 17, 2012 at 11:26
  • No, it is not my actual code. It is just a part which is related to using the shared memory. Commented Mar 17, 2012 at 11:50
  • 1
    So you would like to know why code you haven't shown which uses shared memory doesnt run as fast as other code you also haven't shown which doesn't use shared memory? Do you thing it is reasonable to expect an answer? Commented Mar 17, 2012 at 11:56

1 Answer 1

1
  1. Maybe because every thread in a block tries to write the same shared memory addresses concurrent ranging from 0 to pSize!
    Use thread collaborative loading of global memory data into shared memory: http://forums.nvidia.com/index.php?showtopic=216640&view=findpost&p=1332005
    Every thread in your kernel performs "pSize" global memory reads.
Sign up to request clarification or add additional context in comments.

4 Comments

I wouldn't read too much into that code. Firstly, because nothing in it contributes to an output, dead code removal will remove everything inside the kernel. Secondly, the kernel launch is missing a shared memory size argument. So the kernel is both empty, and would fail if it wasn't.
I didn't ;), and i just checked the kernel itself not the call.
I think you should edit your point #2 -- if his pSize is too large for the shared memory on the device (or for the allocation, whichever is smaller), he will get a runtime error or launch error. The compiler/runtime never moves shared allocations to global memory automatically.
I did, thats why I assumed what happens.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.