Assertion `t >= 0 && t < n_classes` failed error

James_Lee · October 9, 2021, 7:46am

When I ran CUDA_LAUNCH_BLOCKING=1 python train.py I got the following error,

/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [342,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [343,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [344,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [345,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [346,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: cunn_SpatialClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [347,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File “train.py”, line 238, in
main()
File “train.py”, line 126, in main
train(net, optimizer)
File “train.py”, line 197, in train
loss1 = criterion_CE(out, torch.squeeze(labels).long())
File “/home/public/software/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/public/software/anaconda3/lib/python3.8/site-packages/torch/nn/modules/loss.py”, line 947, in forward
return F.cross_entropy(input, target, weight=self.weight,
File “/home/public/software/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py”, line 2422, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File “/home/public/software/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py”, line 2220, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1595629395347/work/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:134

It seemed that I should use nn.BCE rather than nn.CrossEntropyLoss here? I found only CrossEntropyLoss support ignore_index for classification while nn.BCE didn’t.

Here’s my loss,

        criterion_CE = nn.CrossEntropyLoss(ignore_index=-1).cuda()
        loss = criterion_CE(out, torch.squeeze(labels).long())

ptrblck · October 9, 2021, 9:59am

The target tensor for nn.CrossEntropyLoss is expected to contain class indices in the range [0, nb_classes-1], which seems to fail in your script.
Check its values via print(target.min(), target.max()) and make sure they are valid.

James_Lee · October 9, 2021, 10:59am

After print(target.min(), target.max()) ,I got target.min() == 0 rather than -1 I expected. I set the label pixel value to -1 follow your advice here Got RuntimeError: Boolean value of Tensor with more than one value is ambiguous during training - #2 by ptrblck , but it seemed this did not work as I expected? Could you please give me any clues? Many thanks.

James_Lee · October 9, 2021, 11:32am

Now I got the point. It’s this line of code lead to such strange behavior.

        labels1 = functional.interpolate(labels, size=24, mode='bilinear')

        print("### labels.long().min()", labels.long().min())
        print("### labels.min()", labels.min())

        print("### labels1.long().min()", labels1.long().min())
        print("### labels1.min()", labels1.min())

I got,

### labels.long().min() tensor(-1, device='cuda:2')
### labels.min() tensor(-1., device='cuda:2')
### labels1.long().min() tensor(0, device='cuda:2')
### labels1.min() tensor(0., device='cuda:2')

What’s wrong with functional.interpolate here? Thanks.

ptrblck · October 9, 2021, 8:13pm

You are interpolating values using the bilinear approach and rounding afterwards, which might change the values. I’m not familiar with your use case, but as previously described, the expected target values are in [0, nb_classes-1] unless you use ignore_index for a specific index value.
Your current code crashes, because the target values are not in this range.

James_Lee · October 10, 2021, 2:55am

Thanks for your reply. Sorry to disturb you again.For the last question, I had set ignore_index to -1 .

criterion_CE = nn.CrossEntropyLoss(ignore_index=-1).cuda()

…
labels[(0.3 <= labels) & (labels <= 0.7)] = -1
…

loss = criterion_CE(out, torch.squeeze(labels).long())

How could I fix this?

ptrblck · October 10, 2021, 10:29pm

I’m not sure what you are trying to fix.
If you are manually setting some target indices to -1, which is invalid in the default setup, you have to use ignore_index=-1, since otherwise the expected error will be raised.