1

I have a Cuda kernel that runs well if I use the nsight cuda profiler or if I run it directly from the terminal. But if I use this command

cuda-memcheck --leak-check full ./CudaTT 1 ../../file.jpg 

It crashes with "unspecified launch failure". I'm using this after each kernel code.

e=cudaDeviceSynchronize();

if (e != cudaSuccess) printf("Fail in kernel 2 %s",cudaGetErrorString(e));

and cuda-memcheck shows several of this

========= Program hit error 4 on CUDA API call to cudaDeviceSynchronize 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 4 on CUDA API call to cudaFree 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]

in the end it shows

========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors

Any idea why this happens?

Edit:

I commented out another kernel which was not launching due to having many registers and now the error on the kernel above changed now it says: "the launch timed out and was terminated". Again it runs ok on the cuda profiler and without cuda-memcheck on the terminal but when using cuda-memcheck it shows this

========= Program hit error 6 on CUDA API call to cudaDeviceSynchronize 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 6 on CUDA API call to cudaFree 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
=========     Host Frame:[0xbf913ea8]

And the same 10 errors in the end

========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors

Error 6 appears to be due to a timeout of a kernel lasting too much time but how come it works without cuda-memcheck? On the profiler it shows the kernel lasts 3.771 seconds.

Another strange behavior is that I'm printing some values after the calculations. The values are different if I use cuda-memcheck than if I don't.

2 Answers 2

2

A better link would be http://docs.nvidia.com/cuda/cuda-memcheck/index.html. Cuda-memcheck can and does alter the run time of the application's CUDA kernels. If the GPU is being used for display, then a watchdog timeout is present that prevents the runtime of the kernel from exceeding a fixed boundary (on Linux, this is usually ~5 seconds). Given that the uninstrumented kernel takes 3.7 seconds, it is very likely that the modified version of the kernel being run by memcheck is actually exceeding the watchdog and hence the kernel launch is being timed out. There are a couple of options in such cases :

  1. Run on a system where X has not been started
  2. Launch the X server in non interactive mode using Option "Interactive" "off" in /etc/X11/xorg.conf. Note that in this mode, the display will not update while the CUDA kernel is running.
Sign up to request clarification or add additional context in comments.

1 Comment

I have had more success with the flag to optirun; --no-xorg. In my experience running with this flag works where as the method described in the above answer does not.
0

It appears kernels launch much slower with cuda-memcheck

people.maths.ox.ac.uk/gilesm/cuda/doc/cuda-memcheck.pdf

Page 16

"Applications run much slower under CUDA‐MEMCHECK. This may cause some kernel launches to fail with a launch timeout error when running with CUDA‐ MEMCHECK enabled. "

5 Comments

As you've discovered, cuda-memcheck can do a variety of non-obvious things during execution of your code. In particular it can modify threadblock execution order and also run fewer threadblocks at a time. Since a proper CUDA code should be independent of threadblock execution order, none of this "should" matter. If it does make a difference, then it suggests that you may have a dependency in your code that should not be there (a latent bug). A possible side effect is longer execution time, which could trigger the windows timeout that you might not ordinarily hit.
As I understand correctly kernels launch asynchronously? right? but shouldn't the cudaDeviceSynchronize code oblige synchronization? In that case I don't understand how the printf call which comes after the kernel gets called. Could it be that it is printing the wrong values because the kernel was not able to finish processing everything and writing it into global memory?
I'm not sure I can answer this in the comments. And you've shown, like 2 lines of code. Yes, kernels launch asynchronously. But unless you're using streams, all cuda functions in your code get executed in order, anyway. The device synchronize call will prevent any subsequent host code from beginning to execute until the previous cuda activity is finished. But ultimately, all of your host code (i.e. printf statements) are going to execute, one way or the other. And if a kernel fails for some reason, the printfs are likely to print out strange results. You've shown so little code??
Yeah my code is large so I think it'll be confusing but I think that answered my question. Tomorrow I'm testing the code on a remote machine with no X so the timeout problem shouldn't be an issue even with the cuda-memcheck I guess. I'll check if the values are correct then. Thanks.
I ran the program on the remote cluster with memcheck and there was no timeout afterall so I guess it was more a problem with running it on a machine with Cuda devices for display.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.