Using Shared & Constant Memory in CUDA

Question

I want to read a text file and store it in an array. Then, I want to transfer the array from the host to the device and store it in the shared memory. I have written the following code,but the execution time has been increased compared with using the global memory. I cannot understand what the reason can be? Also, it will be great if someone can help me write this code using constant memory.

__global__ void deviceFunction(char *pBuffer,int pSize){
    extern __shared__ char p[];
    int i;
    for(i=0;i<pSize;i++)}
        p[i] = pBuffer[i];
    }
}
int main(void){

    cudaMalloc((void**)&pBuffer_device,sizeof(char)*pSize);
    cudaMemcpy(pBuffer_device,pBuffer,sizeof(char)*pSize,cudaMemcpyHostTo Device);
    kernel<<<BLOCK,THREAD>>>(pBuffer_device,pSize);

}

The code you have posted doesn't do anything and wouldn't run even if it did. This isn't your actual code, is it? — talonmies
– talonmies, Commented Mar 17, 2012 at 11:26
No, it is not my actual code. It is just a part which is related to using the shared memory. — user1192151
– user1192151, Commented Mar 17, 2012 at 11:50
So you would like to know why code you haven't shown which uses shared memory doesnt run as fast as other code you also haven't shown which doesn't use shared memory? Do you thing it is reasonable to expect an answer? — talonmies
– talonmies, Commented Mar 17, 2012 at 11:56

djmj · Accepted Answer · 2012-03-19 09:27:15Z

1

Maybe because every thread in a block tries to write the same shared memory addresses concurrent ranging from 0 to pSize!
Use thread collaborative loading of global memory data into shared memory: http://forums.nvidia.com/index.php?showtopic=216640&view=findpost&p=1332005
Every thread in your kernel performs "pSize" global memory reads.

edited Mar 19, 2012 at 9:27

answered Mar 17, 2012 at 16:33

djmj

5,5946 gold badges57 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

talonmies Over a year ago

I wouldn't read too much into that code. Firstly, because nothing in it contributes to an output, dead code removal will remove everything inside the kernel. Secondly, the kernel launch is missing a shared memory size argument. So the kernel is both empty, and would fail if it wasn't.

djmj Over a year ago

I didn't ;), and i just checked the kernel itself not the call.

harrism Over a year ago

I think you should edit your point #2 -- if his pSize is too large for the shared memory on the device (or for the allocation, whichever is smaller), he will get a runtime error or launch error. The compiler/runtime never moves shared allocations to global memory automatically.

djmj Over a year ago

I did, thats why I assumed what happens.

Collectives™ on Stack Overflow

Using Shared & Constant Memory in CUDA

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related