Cuda Memcpy Device to Host : unspecified error launch failure

Question

This is a simple test program that I have been working on (to help aid with debugging my work on a running sum function) and I just cannot seem to find whats wrong. The program simply calls my running sum function on a small list and attempts to print out the data. The line thats creating all the trouble is the one thats commented out. Its the cudaMemcpy(DeviceToHost). When that line is part of the code, the error I get is :

CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed

I simply do not know whats wrong with this and its driving me insane. I tried using regular old malloc with the same result. I have confirmed that the input data gets copied over to the device array fine (by printing in the kernel) but simply am not able to copy back the results from Device to Host. I would really appreciate any help whatsoever! Thanks in advance :)

unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );

unsigned int* d_in;

unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));

for (int i = 0; i < numElems; i++)
{
   h_in[i] = i;
}

checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));

exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));

for (int i = 0; i < numElems; i++)
{
    printf("%i %i\n", i, h_in[i]);
}

The cudaMemcpy call isn't the problem. Your kernel is failing somehow. — talonmies
– talonmies, Commented Jul 6, 2013 at 20:59
I've found an unspecified launch failure is sometimes bad block or grid sizes. Try changing them to, say, 1 and seeing if it runs. Also, are they ok as ints? It's been a long time since I've CUDAd. — P O'Conbhui
– P O'Conbhui, Commented Jul 7, 2013 at 6:40
@PO'Conbhui: No. illegal block or grid dimensions will produce a cudaErrorInvalidConfiguration error in the runtime API. — talonmies
– talonmies, Commented Jul 7, 2013 at 7:12
Thanks for the replies. I have confirmed that the kernel works fine when that line is commented out. I even tried printing out the results (from the kernel itself) and they are fine. Its definitely the copy thats causing this. I will try playing with the array size and get back. — Rajiv Nair
– Rajiv Nair, Commented Jul 7, 2013 at 7:40
@RajivNair: Could you edit your question to include the shortest, complete version of the code which someone else could compile and run themselves. The error you are seeing is really being generated by the kernel, it will be a combination of imperfect error checking and inexperience which is leading you to an incorrect diagnosis of the problem. But we can't help you without seeing a complete example which replicates the problem. — talonmies
– talonmies, Commented Jul 7, 2013 at 8:40

Rajiv Nair · Accepted Answer · 2013-07-07 15:59:54Z

1

Thanks to everyone for the help. I have found the bug. After much debugging, I have realized that I (very very foolishly) forgot about the fact that I had used an externally allocated shared data within the kernel.

answered Jul 7, 2013 at 15:59

Rajiv Nair

1871 gold badge3 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Cuda Memcpy Device to Host : unspecified error launch failure

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related