RuntimeError: cuda runtime error (59) even with del loss, output

josueortc · November 8, 2017, 8:37pm

I am running a model on UCF101, and I am encountering this error, after running the model for 8 iterations (even thought it changes in which iteration stops):

/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 321, in
main()
File “main.py”, line 158, in main
train(train_loader, model, optimizer, epoch, criterion)
File “main.py”, line 201, in train
loss = criterion(output, target_var)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 482, in forward
self.ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 746, in cross_entropy
return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 672, in nll_loss
return _functions.thnn.NLLLoss.apply(input, target, weight, size_average, ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py”, line 47, in forward
output, *ctx.additional_args)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/Clas

I am not sure, what it means but based on nvidia-smi, the model is only occupying 2.5 GB of ram.

SKYHOWIE25 · November 8, 2017, 11:03pm

Hi

Faced this problem too. Please check the output of your model to make sure that the dimension of the output matches the number of classes.

SimonW · November 8, 2017, 11:24pm

From the error log you can see that it’s not a memory error. Your input to NLLCriterion might not be properly set.

josueortc · November 9, 2017, 12:08am

Thanks, I figure out the problem. For some reason (Beyond me), some examples had a class that was n_classes + 1, so it was giving this error when your batch contained those examples. I redownloaded the dataset and it’s working now.