CUDA runtime error (59) : device-side assert triggered

Question

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 109, in <module>
    train(loader_train, model, criterion, optimizer)
  File "main.py", line 54, in train
    optimizer.step()
  File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
    d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265

How do I resolve this error?

try running your script with CUDA_LAUNCH_BLOCKING=1 python your_script.py to get a more accuracte stack trace. — McLawrence
– McLawrence, Commented Aug 5, 2018 at 7:16
after running with CUDA_LAUNC...=1, I get the error as /opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. This would come around 20 times. then the Traceback follows: RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116 how to resolve? — saichand
– saichand, Commented Aug 5, 2018 at 8:00
This is an error with your target labels: t >= 0 && t < n_classes. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. — McLawrence
– McLawrence, Commented Aug 5, 2018 at 8:04
n_classes should be same as the output of the last layer.. Is it right? — saichand
– saichand, Commented Aug 5, 2018 at 8:11

Mateen Ulhaq · Accepted Answer · 2022-07-16 23:16:24Z

129

This is usually an indexing issue.

For example, if your ground truth label starts at 1:

target = [1,2,3,4,5]

Then you should subtract 1 for every label instead so that:

target = [0,1,2,3,4]

edited Jul 16, 2022 at 23:16

Mateen Ulhaq

27.9k21 gold badges121 silver badges155 bronze badges

answered Mar 21, 2019 at 1:08

Rainy

1,2911 gold badge8 silver badges3 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Christian Over a year ago

I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails.

Kunj Mehta Over a year ago

@Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right?

Chandra Over a year ago

@KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5.

Nihat Over a year ago

I get the error even though I have the setup you offer

seni Over a year ago

@Rainy My labels are [10,11,12,13,14,15,16,17,18,19], when I make in FCL 10 as output, I find error, instead with 20, it run correctly, how can I solve issue please

|

roboo.jack · Accepted Answer · 2019-05-12 20:20:14Z

82

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

edited May 12, 2019 at 20:20

roboo.jack

74 bronze badges

answered Aug 6, 2018 at 6:28

McLawrence

5,2757 gold badges42 silver badges51 bronze badges

2 Comments

Eric Wiener Over a year ago

To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining CUDA_LAUNCH_BLOCKING=1 with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on.

curiouz Over a year ago

How to run this on Kaggle kernel?

R Tiffin · Accepted Answer · 2020-09-18 08:27:40Z

19

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

answered Sep 18, 2020 at 8:27

R Tiffin

3212 silver badges6 bronze badges

1 Comment

Mew Over a year ago

Good suggestion. The error I got by using CPU as device was very clear; I had written a very basic indexing bug.

hdkrgr · Accepted Answer · 2021-08-15 10:08:40Z

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

arame3333 · Accepted Answer · 2021-08-20 22:32:48Z

5

I found I got this error when I had a label with an invalid value.

answered Aug 20, 2021 at 22:32

arame3333

10.3k29 gold badges134 silver badges218 bronze badges

1 Comment

Vinod Kumar Chauhan Over a year ago

Even in my case, the issue was with the invalid value of labels as I forgot to put activation in the last layer. Thanks!

Shaina Raza · Accepted Answer · 2020-10-07 17:36:23Z

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)

Neil Murphy · Accepted Answer · 2023-05-24 16:38:52Z

3

This occurred for me when the length of the input tokens for an instance was greater than the max for the model, and when the input length was greater than the max_output_length prediction param.

answered May 24, 2023 at 16:38

Neil Murphy

675 bronze badges

2 Comments

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Ahsan Nawaz Over a year ago

Thanks Murph, I have been finding solution for this problem for many days and your comment helped me fixing the issue.

Angel Salamanca · Accepted Answer · 2022-05-30 08:28:19Z

1

Another situation where this can happen: you are training a dataset with more classes than you last layer expects. It's another unexpected index situation

answered May 30, 2022 at 8:28

Angel Salamanca

361 silver badge3 bronze badges

Comments

Valentin · Accepted Answer · 2022-04-29 10:32:33Z

0

Happened to me multiple time when the target or label of the bce or ce loss would be <= 0.

answered Apr 29, 2022 at 10:32

Valentin

3004 silver badges14 bronze badges

Comments

user2299067 · Accepted Answer · 2022-05-11 04:14:15Z

0

This can also be caused by nan values in your model input data. One easy way to "treat" this problem is to convert any that pop up into zeros on the fly:

batch_data = batch_data[batch_data != batch_data] = 0

answered May 11, 2022 at 4:14

user2299067

10711 bronze badges

Comments

Omid Khalaf Beigi · Accepted Answer · 2022-06-17 16:27:00Z

I wish your problem got solved, but I faced with this issue and spent almost 2 hours to solve it, so I will explain problem and solvation method here for people who are like me.
I had this problem because of class labels.
My project was about sentiment analysis with three classes, so I labeled dataset with values: -1, 0, 1 (3 nodes in output layer) that it caused my problem!
So I re-labeled dataset with values 0, 1, 2 and it got solved. It's important to label samples by start at 0 (PyTorch uses index as class label, so you should be careful).
For people who face with error saying set CUDA_LAUNCH_BLOCKING = 1, you should use this command before importing PyTorch: os.environ['CUDA_LAUNCH_BLOCKING'] = "1", and if you faced with same error (no more information about error) you should run script by CPU and try again (this time you probably get new information about problem).

martin36 · Accepted Answer · 2022-07-19 19:28:07Z

0

I got this error when I was using the Huggingface Transformer model LongformerEncoderDecoder (LED), and setting the decoder length too large. In my case the default maximum length for the decoder was 1024.

Hope this helps someone

answered Jul 19, 2022 at 19:28

martin36

2,3734 gold badges22 silver badges31 bronze badges

Comments

M M Kamalraj · Accepted Answer · 2024-02-10 02:21:18Z

0

Model used: distilbert-base-uncased

data used: glue / cola

issue: test dataset in glue cola was having -1 in the labels, while the validation/ train datasets have {0, 1}

triangulation: Enumerated the label values of all three datasets using list comprehension and converted it to set, for getting the unique values

set([row['label'] for row in glue_cola['test']])

solution: In the transformers.Trainer() class, use the validation dataset instead of test dataset. The above error gets resolved.

answered Feb 10, 2024 at 2:21

M M Kamalraj

113 bronze badges

Comments

sathish kumar · Accepted Answer · 2024-05-31 12:55:09Z

0

In my case, the COCO-formatted JSON has two different labels, but the PKL file has only one label. This discrepancy causes an error. Ensure that the label counts are the same in both files to avoid this issue.

answered May 31, 2024 at 12:55

sathish kumar

262 bronze badges

Comments

utkarsh2299 · Accepted Answer · 2024-08-29 17:06:43Z

0

I was using the Hugging face to train my model when I got this error. I wanted to add more special tokens to the pretrained tokenizer and added them using add_token() method. I got stuck on the issue for few weeks before realising that I had to resize the token embeddings matrix of the model as well using the resize_token_embeddings(len(toeknizer)) method.

answered Aug 29, 2024 at 17:06

utkarsh2299

3413 silver badges4 bronze badges

Comments

user25428929 · Accepted Answer · 2024-06-05 09:05:03Z

-1

Target 11 is out of bounds. def get_label(args): return [label.strip() for label in open(os.path.join(args.data_dir, args.label_file), "r", encoding="utf-8")]

answered Jun 5, 2024 at 9:05

user25428929

1

Collectives™ on Stack Overflow

CUDA runtime error (59) : device-side assert triggered

16 Answers 16

7 Comments

2 Comments

1 Comment

Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

7 Comments

2 Comments

1 Comment

Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related