Problem:
I occasionally get the following CUDA error when running PyTorch scripts with CUDA on an Nvidia GPU, running on CentOS 7.
If I run:
python3 -c 'import torch; print(torch.cuda.is_available()); torch.randn(1).to("cuda")'
I get the following output:
True
Traceback (most recent call last):
File "<string>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
PyTorch seems to think the GPU is available, but I can't put anything onto it's memory. When I restart the computer, the error goes away. I can't seem to get the error to come back consistently.