RuntimeError: CUDA error: device-side assert triggered after the first epoch

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.

Hello, I’m getting this error after the first epoch and I can’t figure what is wrong.

Here’s my implementation. The error is happening on running_loss += loss.item() * inputs.size(0).

Any idea what is going on?

1 Like

I tried with a smaller dataset and it works flawlessly. Why doest it crashes with a bigger dataset?

The error message is most likely pointing to a wrong line, since CUDA operations are called asynchronously.
However, it seems your target values are out of bounds. I assume you are using nn.CrossEntropyLoss, which expects a torch.LongTensor with values in the range [0, nb_classes-1].
Apparently, your smaller dataset does not contain the wrong labels.

You could print your current target tensor and run your code with
CUDA_LAUNCH_BLOCKING=1 python script.py args
to see which target tensor contains the wrong indices.

4 Likes

I can’t believe I actually had an empty folder inside my ./train. I’m so sorry.

Thank you very much for your time.

No need to be sorry! :wink:
I’m glad it’s working now!

2 Likes

Hi, I’m also facing the same issue:
Here’s the line
classification_loss = F.cross_entropy(class_logits, labels)
The dim of class_logits and labels are torch.Size([512, 12]) torch.Size([512])
I’ve 15 classes and here are my labels

tensor([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 1, 0, 0, 0, 5, 0,
0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 12, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 12, 1, 1, 5], device=‘cuda:0’)

I’m not seeing any wrong labels at all.

Here’s the CUDA Error:
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=110 error=59 : device-side assert triggered

File “/ml/temp/object_detection/models/roi_heads.py”, line 87, in fastrcnn_loss
classification_loss = F.cross_entropy(class_logits, labels)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py”, line 2056, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File “/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py”, line 1871, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110

My labels start from 0. As per what you said earlier, my labels are in the range of 0-14, what can be the issue ? @ptrblck

Based on the shape of your output, the maximal class index should be 11, since you are dealing with 12 classes. All values >=12 will yield this error, which is the case in your example.

hi ,I am encounted this problem too,because some classes of my dataset just have one ,so it either in trainset or testset.The label is just like this:trainset_label[0,2,3,5],testset_labe[1,2,4],so I set the num_classes=6,how can I fix this problem?thank you very much!

As long as your model output contains logits (log probabilities) for all six classes, e.g. as [batch_size, 6], there should be no problem from the point of view of your code.
That being said, I would see a problem that some labels (class1 and class4) are only available in the test set, so that your model shouldn’t have a chance to learn these classes.

Thank you for your reply!My dataset is about DNA sequence,such as a dog’s sequence maybe like this,[ATCGATCG],it contain four alphabet:ATCG,the order of the DNA sequence contain some information that we can conclude which species it is.So,it’s difficult to do augmentation,not as easy as images,if I just have one dog,which means I just have one DNA sequence.Thus,I split the datasets in random.If you have some good advice,please tell me,thank you,sir.

I’m not familiar with DNA sequencing, but I would try to split the dog sequence into a training, validation, and test set. This would make sure, that each data split was drawn from approx. the same distribution.

@GISKing
It seems your model output has the shape [batch_size, 5], which would mean that your target should be a LongTensor with the shape [batch_size] containing the class indices.

Thank you for your reply!If it is singleton,it means that the classes just have one sequence,so if I want to split the data to training,validation and test set,I must copy the one sequence three times.Is that data leak?The model had seen the ‘answer’.

My label is all right, the label is 0 or 1 for 2 class, it is work in previous running and anyother networks, but when i repeat the program, the error told me the same error above.