2

I install Nvidia Windows Driver and CUDA according to this article. After the installation of Nvidia Windows Driver, I’ve checked CUDA version by running “/usr/lib/wsl/lib/nvidia-smi”:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

Then I installed CUDA Toolkit 11.3 according to this this article. After this , I checked the CUDA Toolkit version by running “/usr/local/cuda/bin/nvcc --version” and got:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

Then I install Pytorch through pip:

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Then verify the installation of torch like this:

import torch
x = torch.rand(5, 3)
print(x)

and this:

import torch
torch.cuda.is_available()

Until now, everything goes well. However, when I train a network and call the backward() method of loss, torch throws a runtime error like this:

Traceback (most recent call last):
File "train.py", line 118, in train_loop
  loss.backward()
File "/myvenv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
  torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/myvenv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
  allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I've tried to reinstall CUDA toolkit many times but always got the same error. Any suggestions?

2
  • Is this the full trace? Commented Nov 19, 2021 at 17:39
  • @YakovDan Yes, this is the full trace. What puzzles me most is the same code works well on CPU, but always fails on GPU. Commented Jan 23, 2022 at 1:24

1 Answer 1

2

Through some simple experiments, I find the solution. The cause is my GPU memory is too small (2GB) to run a relatively large text batch (32). When I decrease the batch size to 16, training script runs well.

However, I still don't know why CUDA can't throw an exception with a more clear message for this kind of OOM error.

Sign up to request clarification or add additional context in comments.

2 Comments

This problem comes up for me anytime I set my batch-size to > 1. This doesn't make sense given I've got an RTX 3070 (8gb).
Works for me when I decrease my batch size from 64 to 4(my GPU memory is 4GB).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.