I am having a problem while training my network in the first epoch. The model starts training but it throws this error:
/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes
failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 322, in
main()
File “main.py”, line 158, in main
train(train_loader, model, optimizer, epoch, criterion)
File “main.py”, line 206, in train
losses.update(loss.data[0], input_var.size(0))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.c:32
This comes from this snippet of code:
input_var = torch.autograd.Variable(input).cuda()
target_var = torch.autograd.Variable(target)
#optimizer.zero_grad()
#print(input_var.volatile)
# compute output for the number of timesteps selected by train loader
output = model.forward(x=input_var)
#print(output.volatile)
# CLean the gradient
#optimizer.zero_grad()
# Calculate the loss function based on the criterion. For example, UCF-101 is CrossEntropy
loss = criterion(output, target_var)
#print(loss.volatile)
# measure accuracy and record loss
prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
#print(loss.data.size())
losses.update(loss.data[0], input_var.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))
I am not sure what’s happening because the model runs a couple of iteration before throwing this error. Is there something related to the memory of the GPU?
Thank you.