RuntimeError: cuda runtime error (59) even with del loss, output

I am running a model on UCF101, and I am encountering this error, after running the model for 8 iterations (even thought it changes in which iteration stops):

/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=59 : device-side assert triggered
Traceback (most recent call last):
File β€œmain.py”, line 321, in
main()
File β€œmain.py”, line 158, in main
train(train_loader, model, optimizer, epoch, criterion)
File β€œmain.py”, line 201, in train
loss = criterion(output, target_var)
File β€œ/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File β€œ/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 482, in forward
self.ignore_index)
File β€œ/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 746, in cross_entropy
return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
File β€œ/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 672, in nll_loss
return _functions.thnn.NLLLoss.apply(input, target, weight, size_average, ignore_index)
File β€œ/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py”, line 47, in forward
output, *ctx.additional_args)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/Clas

I am not sure, what it means but based on nvidia-smi, the model is only occupying 2.5 GB of ram.

Hi

Faced this problem too. Please check the output of your model to make sure that the dimension of the output matches the number of classes.

From the error log you can see that it’s not a memory error. Your input to NLLCriterion might not be properly set.

Thanks, I figure out the problem. For some reason (Beyond me), some examples had a class that was n_classes + 1, so it was giving this error when your batch contained those examples. I redownloaded the dataset and it’s working now.

2 Likes