Unsolved: NLLLoss throws Assertion `t >= 0 && t < n_classes` failed (when batchsize>1)

Hi all,

So, I’ve been stuck with this issue for days and I’ve tried all the suggestions I could find in the forums.
I am training a model with output layer using log_softmax and loss function NLLLoss.
The problem:

  1. when I run the code on cpu all works fine!
  2. when I run the code on gpu with batchsize==1 all works fine!
  3. when I run the code on gpu with batchsize>1 I get the dreaded CUDA error: device-side assert triggered error.

Upon adding the flag ‘CUDA_LAUNCH_BLOCKING’ = 1 I can see that the error is coming from a cuda assert:

C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:455: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.

As you can see in screen shots 1 and 2, I am logging the min and max target labels to ensure that they do not go below 0 or exceed n-classes-1.

I have been stuck on this problem for days now. Any help at all is appreciated. Please help.

Here is the screen shot for when batchsize == 1 on gpu.

I get no errors on this…

Also here is the cuda error after adding ‘CUDA_LAUNCH_BLOCKING’ = 1

Could you post a minimal executable code snippet reproducing the error, please?