I have a Cuda kernel that runs well if I use the nsight cuda profiler or if I run it directly from the terminal. But if I use this command
cuda-memcheck --leak-check full ./CudaTT 1 ../../file.jpg
It crashes with "unspecified launch failure". I'm using this after each kernel code.
e=cudaDeviceSynchronize();
if (e != cudaSuccess) printf("Fail in kernel 2 %s",cudaGetErrorString(e));
and cuda-memcheck shows several of this
========= Program hit error 4 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 4 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
in the end it shows
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
Any idea why this happens?
Edit:
I commented out another kernel which was not launching due to having many registers and now the error on the kernel above changed now it says: "the launch timed out and was terminated". Again it runs ok on the cuda profiler and without cuda-memcheck on the terminal but when using cuda-memcheck it shows this
========= Program hit error 6 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 6 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
========= Host Frame:[0xbf913ea8]
And the same 10 errors in the end
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
Error 6 appears to be due to a timeout of a kernel lasting too much time but how come it works without cuda-memcheck? On the profiler it shows the kernel lasts 3.771 seconds.
Another strange behavior is that I'm printing some values after the calculations. The values are different if I use cuda-memcheck than if I don't.