2

Problem:

I occasionally get the following CUDA error when running PyTorch scripts with CUDA on an Nvidia GPU, running on CentOS 7.

If I run:

python3 -c 'import torch; print(torch.cuda.is_available()); torch.randn(1).to("cuda")'​

I get the following output:

True​
Traceback (most recent call last):​
  File "<string>", line 1, in <module>​
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable​
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.​
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

PyTorch seems to think the GPU is available, but I can't put anything onto it's memory. When I restart the computer, the error goes away. I can't seem to get the error to come back consistently.

1 Answer 1

4

When I'm outside of Python and run nvidia-smi, it shows a process running on the GPU, despite the fact that I cancelled execution of the PyTorch script:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0    33W / 250W |   1215MiB / 32510MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18805      C   python3                          1211MiB |
+-----------------------------------------------------------------------------+

If I kill the process (with PID=18805), by running kill -9 18805, the process no longer appears in nvidia-smi, and the error does not recur.

Any insights on a better solution, or how to avoid this problem in the first place, are very welcome.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.