This is a simple test program that I have been working on (to help aid with debugging my work on a running sum function) and I just cannot seem to find whats wrong. The program simply calls my running sum function on a small list and attempts to print out the data. The line thats creating all the trouble is the one thats commented out. Its the cudaMemcpy(DeviceToHost). When that line is part of the code, the error I get is :
CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed
I simply do not know whats wrong with this and its driving me insane. I tried using regular old malloc with the same result. I have confirmed that the input data gets copied over to the device array fine (by printing in the kernel) but simply am not able to copy back the results from Device to Host. I would really appreciate any help whatsoever! Thanks in advance :)
unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );
unsigned int* d_in;
unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));
for (int i = 0; i < numElems; i++)
{
h_in[i] = i;
}
checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));
exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));
for (int i = 0; i < numElems; i++)
{
printf("%i %i\n", i, h_in[i]);
}
cudaErrorInvalidConfigurationerror in the runtime API.