I am training my model for several epochs without errors and suddenly got a cublas runtime error. The complete stack trace is:
Traceback (most recent call last):
File “train_revise.py”, line 405, in
main()
File “train_revise.py”, line 363, in main
model, criterionCTC, optimizer, data, target)
File “train_revise.py”, line 144, in train_batch
cost.backward()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cublas runtime error : an internal operation failed at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:249
If I resume training, I will get this error right away. So I can not start training now.
The solution poposed on this related post does not work.
Maybe. The other day, one of the GPU in the server does not show up when executing nvidia-smi. The root user has reboot the computer once. After that, I start training without error until I ran into this issue.
I run the code of another project. It runs well without error on the same GPU. Maybe this is not a GPU problem. but the error message is hard to understand to debug the code.
Finally, I have found the reason for this error. I am using the warp_ctc package to do ocr recognition. The signature for CTCLoss is:
CTCLoss(size_average=False, length_average=False)
# size_average (bool): normalize the loss by the batch size (default: False)
# length_average (bool): normalize the loss by the total number of frames in the batch. If True, supersedes size_average (default: False)
forward(acts, labels, act_lens, label_lens)
# acts: Tensor of (seqLength x batch x outputDim) containing output activations from network (before softmax)
# labels: 1 dimensional Tensor containing all the targets of the batch in one large sequence
# act_lens: Tensor of size (batch) containing size of each output sequence from the network
# label_lens: Tensor of (batch) containing label length of each example
The error is caused by a code bug in calculating the label length in a batch of labels. When I correct this error, the cublas error disappears.