Cublas runtime error after some epoch of training and can not resume training now

I am training my model for several epochs without errors and suddenly got a cublas runtime error. The complete stack trace is:

Traceback (most recent call last):
File “train_revise.py”, line 405, in
main()
File “train_revise.py”, line 363, in main
model, criterionCTC, optimizer, data, target)
File “train_revise.py”, line 144, in train_batch
cost.backward()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cublas runtime error : an internal operation failed at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:249

If I resume training, I will get this error right away. So I can not start training now.

The solution poposed on this related post does not work.

Hi,

This looks like a GPU error if restarting raises this error right away.
Are you sure that the GPU is not overheating or something like that?

Maybe. The other day, one of the GPU in the server does not show up when executing nvidia-smi. The root user has reboot the computer once. After that, I start training without error until I ran into this issue.

How to check if the GPU is overheating?

nvidia-smi should give you the temperature. It should not go above 80/81C for GTX cards. Not sure for server cards.

The temperature is 36c. I changed the GPU id, but the same error persists.

Ho ok, so not that. Can you run other GPU code on the machine? Or just your code fails?

I run the code of another project. It runs well without error on the same GPU. Maybe this is not a GPU problem. but the error message is hard to understand to debug the code.

Can you try and save some state (like network input, current weights) to get a reproducible example that triggers this problem please?

Finally, I have found the reason for this error. I am using the warp_ctc package to do ocr recognition. The signature for CTCLoss is:

CTCLoss(size_average=False, length_average=False)
    # size_average (bool): normalize the loss by the batch size (default: False)
    # length_average (bool): normalize the loss by the total number of frames in the batch. If True, supersedes size_average (default: False)

forward(acts, labels, act_lens, label_lens)
    # acts: Tensor of (seqLength x batch x outputDim) containing output activations from network (before softmax)
    # labels: 1 dimensional Tensor containing all the targets of the batch in one large sequence
    # act_lens: Tensor of size (batch) containing size of each output sequence from the network
    # label_lens: Tensor of (batch) containing label length of each example

The error is caused by a code bug in calculating the label length in a batch of labels. When I correct this error, the cublas error disappears.