Error of cuda runtime error (59)

Wanger-SJTU · July 19, 2018, 4:57pm

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/generic/THCStorage.c:36

the detailed info is

/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [886,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [892,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [893,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [894,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [895,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [804,0,0] Assertion t >= 0 && t < n_classes failed.

main()
File “train.py”, line 135, in main
train_FCN(opt)
File “train.py”, line 82, in train_FCN
trainer.train()
File “/media/sjtu/831bebd9-c866-4ece-b878-5dbd68e5ca50/sjtu/CH/seg_transfer/models/trainer.py”, line 240, in train
self.train_epoch()
File “/media/sjtu/831bebd9-c866-4ece-b878-5dbd68e5ca50/sjtu/CH/seg_transfer/models/trainer.py”, line 169, in train_epoch
self.validate()
File “/media/sjtu/831bebd9-c866-4ece-b878-5dbd68e5ca50/sjtu/CH/seg_transfer/models/trainer.py”, line 80, in validate
if np.isnan(float(loss.data.item())):
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THC/generic/THCStorage.c:36

actually the error occurs while calculating the loss function CrossEntropyLoss2d for segmentation task.
I checked the label image, there is no -1 and the num of output channels is correct.

Is there anyone get the point where the problem?

albanD · July 19, 2018, 5:31pm

Hi,

To get a better error message you can run the same thing on CPU.
In this case the error is that your label is not in 0 <= your_label < n_classes.
If you say that no label is negative, then one must be greater than your number of classes.
This is easy to check by checking the value of your label tensor and make sure it they are not too large.

Wanger-SJTU · July 20, 2018, 7:37am

I have checked all the value of the label. None is greater than the num of classes. wired bug.

LMA · March 27, 2019, 3:58pm

Wanger-SJTU, did you resolve the problem? I have encountered a similar problem. Thanks.

Ontheroad123 · March 17, 2020, 7:42am

i have met this error ,and i spent total day to check my code,but finally i found the error was valmask data.first you should check your all label image,make sure the label is [0, n_class-1]

Jo-w · March 30, 2022, 2:46pm

Thank you so much! That is exactly what I need.