PyTorch with CUDA and Nvidia card: RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable, but torch.cuda.is_available() is True

Question

Problem:

I occasionally get the following CUDA error when running PyTorch scripts with CUDA on an Nvidia GPU, running on CentOS 7.

If I run:

python3 -c 'import torch; print(torch.cuda.is_available()); torch.randn(1).to("cuda")'

I get the following output:

True
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

PyTorch seems to think the GPU is available, but I can't put anything onto it's memory. When I restart the computer, the error goes away. I can't seem to get the error to come back consistently.

srcerer · Accepted Answer · 2021-11-10 20:44:29Z

When I'm outside of Python and run nvidia-smi, it shows a process running on the GPU, despite the fact that I cancelled execution of the PyTorch script:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0    33W / 250W |   1215MiB / 32510MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18805      C   python3                          1211MiB |
+-----------------------------------------------------------------------------+

If I kill the process (with PID=18805), by running kill -9 18805, the process no longer appears in nvidia-smi, and the error does not recur.

Any insights on a better solution, or how to avoid this problem in the first place, are very welcome.

Collectives™ on Stack Overflow

PyTorch with CUDA and Nvidia card: RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable, but torch.cuda.is_available() is True

Problem:

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Problem:

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related