0

I'm using PyCuda to run a kernel that is expected to take at least two hours to complete, but it is failing after around one hour with the simple error of:

pycuda._driver.Error: cuCtxSynchronize failed: unknown error

I'm using Windows, and I added the registry key TdrDelay and set it to 120000000 to ensure that Windows is not timing out my kernel.

This error doesn't happen when I adjust the parameters of the kernel so it is expected to complete in about 30 minutes. Why could the synchronize call be failing after the kernel has run for a long time?

Could my graphics card be overheating and preemptively terminating the kernel? Could there be a CUDA setting that terminates a kernel if it runs for too long? Could running the kernel in NVidia Visual Profiler help figure out what the problem might be?

2
  • 1
    my guess would be that you are still hitting a tdr timeout. I'm not sure that your setting does what you think it does. Yes, your graphics card could be overheating, but this isn't usually possible (the GPU should have a mechanism to manage temperature, regardless of load). You can monitor temperatures with nvidia-smi. There are no CUDA settings that terminate a long-running kernel (other than the aforementioned windows WDDM TDR). I doubt the visual profiler will shed any useful light on this. Commented May 16, 2018 at 14:37
  • The TdrDelay definitely does something, because before I added that key my kernel was timing out after two seconds. Maybe TdrDelay has some maximum value. Commented May 17, 2018 at 2:02

1 Answer 1

1

I was able to get my long running kernel to complete without error by adding the registry key "TdrLevel" alongside "TdrDelay" and setting its value to 0.

Sign up to request clarification or add additional context in comments.

2 Comments

Can you specify where and what you exactly set? did you set both level and delay to 0?
@NoBlockhit HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers is where those two keys should be added. This link has info on what those keys do: answers.microsoft.com/en-us/windows/forum/all/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.