Device-side assert triggered with SpatialClassNLLCriterion.cu

GloomyGhost-Mosquito · May 2, 2019, 2:33am

Run the code with CUDA_LAUNCH_BLOCKING=1

The Pytorch is 1.0.1

Traceback (most recent call last):
  File "tr.py", line 35, in train
    loss = self.loss(output, target)
  File "lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "lib/python3.6/site-packages/torch/nn/functional.py", line 1792, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

ptrblck · May 2, 2019, 10:04am

Is the code running fine of the CPU? The error message raised in CPU might give some more information.
The current error might point to target indices outside of the valid range [0, nb_classes-1].

GloomyGhost-Mosquito · May 2, 2019, 11:08am

Yes, Cpu is fine. and i will try it recently

GloomyGhost-Mosquito · May 2, 2019, 12:02pm

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [5,0,0], thread: [232,0,0] Assertion `t >= 0 && t < n_classes` failed.

This test shows another error

ptrblck · May 2, 2019, 12:04pm

This error still points to an invalid target index.
It’s a bit strange that your code is running fine on the CPU.
However, try to add an assert statement and check each target batch so only contain values in [0, nb_classes-1].

GloomyGhost-Mosquito · May 2, 2019, 12:07pm

My current error is generated after a period of training, and the time interval of each error is different.

ptrblck · May 2, 2019, 12:09pm

If you are shuffling the data it’s normal, that the erroneous batch is at different iterations.

GloomyGhost-Mosquito · May 2, 2019, 12:10pm

Thanks, I confirm again

GloomyGhost-Mosquito · May 3, 2019, 12:17pm

after removed the loss.backward() line,I got this error

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [5,0,0], thread: [179,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "tr.py", line 42, in train
    self.writer.add_scalar('loss', loss.item())
RuntimeError: CUDA error: device-side assert triggered

The code is running on GPU

if I use CUDA_LAUNCH_BLOCKING=1,gived:

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [350,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=128 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "tr.py", line 36, in train
    loss = self.loss(output, target)
  File "python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "python3.6/site-packages/torch/nn/functional.py", line 1792, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

Any Help？