97
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 109, in <module>
    train(loader_train, model, criterion, optimizer)
  File "main.py", line 54, in train
    optimizer.step()
  File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
    d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265

How do I resolve this error?

6
  • 6
    try running your script with CUDA_LAUNCH_BLOCKING=1 python your_script.py to get a more accuracte stack trace. Commented Aug 5, 2018 at 7:16
  • after running with CUDA_LAUNC...=1, I get the error as /opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. This would come around 20 times. then the Traceback follows: RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116 how to resolve? Commented Aug 5, 2018 at 8:00
  • 14
    This is an error with your target labels: t >= 0 && t < n_classes. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. Commented Aug 5, 2018 at 8:04
  • n_classes should be same as the output of the last layer.. Is it right? Commented Aug 5, 2018 at 8:11
  • That's right. Your targets likely assume to high values. Commented Aug 5, 2018 at 8:16

16 Answers 16

129

This is usually an indexing issue.

For example, if your ground truth label starts at 1:

target = [1,2,3,4,5]

Then you should subtract 1 for every label instead so that:

target = [0,1,2,3,4]
Sign up to request clarification or add additional context in comments.

7 Comments

I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails.
@Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right?
@KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5.
I get the error even though I have the setup you offer
@Rainy My labels are [10,11,12,13,14,15,16,17,18,19], when I make in FCL 10 as output, I find error, instead with 20, it run correctly, how can I solve issue please
|
82

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

2 Comments

To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining CUDA_LAUNCH_BLOCKING=1 with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on.
How to run this on Kaggle kernel?
19

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

1 Comment

Good suggestion. The error I got by using CPU as device was very clear; I had written a very basic indexing bug.
12

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

Comments

5

I found I got this error when I had a label with an invalid value.

1 Comment

Even in my case, the issue was with the invalid value of labels as I forgot to put activation in the last layer. Thanks!
3

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)

Comments

3

This occurred for me when the length of the input tokens for an instance was greater than the max for the model, and when the input length was greater than the max_output_length prediction param.

2 Comments

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
Thanks Murph, I have been finding solution for this problem for many days and your comment helped me fixing the issue.
1

Another situation where this can happen: you are training a dataset with more classes than you last layer expects. It's another unexpected index situation

Comments

0

Happened to me multiple time when the target or label of the bce or ce loss would be <= 0.

Comments

0

This can also be caused by nan values in your model input data. One easy way to "treat" this problem is to convert any that pop up into zeros on the fly:

batch_data = batch_data[batch_data != batch_data] = 0

Comments

0

I wish your problem got solved, but I faced with this issue and spent almost 2 hours to solve it, so I will explain problem and solvation method here for people who are like me.
I had this problem because of class labels.
My project was about sentiment analysis with three classes, so I labeled dataset with values: -1, 0, 1 (3 nodes in output layer) that it caused my problem!
So I re-labeled dataset with values 0, 1, 2 and it got solved. It's important to label samples by start at 0 (PyTorch uses index as class label, so you should be careful).
For people who face with error saying set CUDA_LAUNCH_BLOCKING = 1, you should use this command before importing PyTorch: os.environ['CUDA_LAUNCH_BLOCKING'] = "1", and if you faced with same error (no more information about error) you should run script by CPU and try again (this time you probably get new information about problem).

Comments

0

I got this error when I was using the Huggingface Transformer model LongformerEncoderDecoder (LED), and setting the decoder length too large. In my case the default maximum length for the decoder was 1024.

Hope this helps someone

Comments

0

Model used: distilbert-base-uncased

data used: glue / cola

issue: test dataset in glue cola was having -1 in the labels, while the validation/ train datasets have {0, 1}

triangulation: Enumerated the label values of all three datasets using list comprehension and converted it to set, for getting the unique values

set([row['label'] for row in glue_cola['test']])

solution: In the transformers.Trainer() class, use the validation dataset instead of test dataset. The above error gets resolved.

Comments

0

In my case, the COCO-formatted JSON has two different labels, but the PKL file has only one label. This discrepancy causes an error. Ensure that the label counts are the same in both files to avoid this issue.

Comments

0

I was using the Hugging face to train my model when I got this error. I wanted to add more special tokens to the pretrained tokenizer and added them using add_token() method. I got stuck on the issue for few weeks before realising that I had to resize the token embeddings matrix of the model as well using the resize_token_embeddings(len(toeknizer)) method.

Comments

-1

Target 11 is out of bounds. def get_label(args): return [label.strip() for label in open(os.path.join(args.data_dir, args.label_file), "r", encoding="utf-8")]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.