Pytorch throws CUDA runtime error on WSL2

Question

I install Nvidia Windows Driver and CUDA according to this article. After the installation of Nvidia Windows Driver, I’ve checked CUDA version by running “/usr/lib/wsl/lib/nvidia-smi”:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

Then I installed CUDA Toolkit 11.3 according to this this article. After this , I checked the CUDA Toolkit version by running “/usr/local/cuda/bin/nvcc --version” and got:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Then I install Pytorch through pip:

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Then verify the installation of torch like this:

import torch
x = torch.rand(5, 3)
print(x)

and this:

import torch
torch.cuda.is_available()

Until now, everything goes well. However, when I train a network and call the backward() method of loss, torch throws a runtime error like this:

Traceback (most recent call last):
File "train.py", line 118, in train_loop
  loss.backward()
File "/myvenv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
  torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/myvenv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
  allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I've tried to reinstall CUDA toolkit many times but always got the same error. Any suggestions?

@YakovDan Yes, this is the full trace. What puzzles me most is the same code works well on CPU, but always fails on GPU. — Yan
– Yan, Commented Jan 23, 2022 at 1:24

Yan · Accepted Answer · 2022-04-21 08:36:05Z

2

Through some simple experiments, I find the solution. The cause is my GPU memory is too small (2GB) to run a relatively large text batch (32). When I decrease the batch size to 16, training script runs well.

However, I still don't know why CUDA can't throw an exception with a more clear message for this kind of OOM error.

answered Apr 21, 2022 at 8:36

Yan

3866 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Austin Ulfers Over a year ago

This problem comes up for me anytime I set my batch-size to > 1. This doesn't make sense given I've got an RTX 3070 (8gb).

Lym Zoy Over a year ago

Works for me when I decrease my batch size from 64 to 4(my GPU memory is 4GB).

Collectives™ on Stack Overflow

Pytorch throws CUDA runtime error on WSL2

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related