1

This is a simple test program that I have been working on (to help aid with debugging my work on a running sum function) and I just cannot seem to find whats wrong. The program simply calls my running sum function on a small list and attempts to print out the data. The line thats creating all the trouble is the one thats commented out. Its the cudaMemcpy(DeviceToHost). When that line is part of the code, the error I get is :

CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed

I simply do not know whats wrong with this and its driving me insane. I tried using regular old malloc with the same result. I have confirmed that the input data gets copied over to the device array fine (by printing in the kernel) but simply am not able to copy back the results from Device to Host. I would really appreciate any help whatsoever! Thanks in advance :)

unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );

unsigned int* d_in;

unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));

for (int i = 0; i < numElems; i++)
{
   h_in[i] = i;
}

checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));

exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));

for (int i = 0; i < numElems; i++)
{
    printf("%i %i\n", i, h_in[i]);
}
7
  • 4
    The cudaMemcpy call isn't the problem. Your kernel is failing somehow. Commented Jul 6, 2013 at 20:59
  • I've found an unspecified launch failure is sometimes bad block or grid sizes. Try changing them to, say, 1 and seeing if it runs. Also, are they ok as ints? It's been a long time since I've CUDAd. Commented Jul 7, 2013 at 6:40
  • 2
    @PO'Conbhui: No. illegal block or grid dimensions will produce a cudaErrorInvalidConfiguration error in the runtime API. Commented Jul 7, 2013 at 7:12
  • Thanks for the replies. I have confirmed that the kernel works fine when that line is commented out. I even tried printing out the results (from the kernel itself) and they are fine. Its definitely the copy thats causing this. I will try playing with the array size and get back. Commented Jul 7, 2013 at 7:40
  • 2
    @RajivNair: Could you edit your question to include the shortest, complete version of the code which someone else could compile and run themselves. The error you are seeing is really being generated by the kernel, it will be a combination of imperfect error checking and inexperience which is leading you to an incorrect diagnosis of the problem. But we can't help you without seeing a complete example which replicates the problem. Commented Jul 7, 2013 at 8:40

1 Answer 1

1

Thanks to everyone for the help. I have found the bug. After much debugging, I have realized that I (very very foolishly) forgot about the fact that I had used an externally allocated shared data within the kernel.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.