Cublas runtime error after some epoch of training and can not resume training now

jdhao · October 16, 2018, 9:51am

I am training my model for several epochs without errors and suddenly got a cublas runtime error. The complete stack trace is:

Traceback (most recent call last):
File “train_revise.py”, line 405, in
main()
File “train_revise.py”, line 363, in main
model, criterionCTC, optimizer, data, target)
File “train_revise.py”, line 144, in train_batch
cost.backward()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cublas runtime error : an internal operation failed at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:249

If I resume training, I will get this error right away. So I can not start training now.

The solution poposed on this related post does not work.

albanD · October 16, 2018, 10:02am

Hi,

This looks like a GPU error if restarting raises this error right away.
Are you sure that the GPU is not overheating or something like that?

jdhao · October 16, 2018, 10:53am

Maybe. The other day, one of the GPU in the server does not show up when executing nvidia-smi. The root user has reboot the computer once. After that, I start training without error until I ran into this issue.

How to check if the GPU is overheating?

albanD · October 16, 2018, 10:54am

nvidia-smi should give you the temperature. It should not go above 80/81C for GTX cards. Not sure for server cards.

jdhao · October 16, 2018, 11:04am

The temperature is 36c. I changed the GPU id, but the same error persists.

albanD · October 16, 2018, 11:06am

Ho ok, so not that. Can you run other GPU code on the machine? Or just your code fails?

jdhao · October 16, 2018, 11:19am

I run the code of another project. It runs well without error on the same GPU. Maybe this is not a GPU problem. but the error message is hard to understand to debug the code.

albanD · October 16, 2018, 11:28am

Can you try and save some state (like network input, current weights) to get a reproducible example that triggers this problem please?

jdhao · October 17, 2018, 3:01am

Finally, I have found the reason for this error. I am using the warp_ctc package to do ocr recognition. The signature for CTCLoss is:

CTCLoss(size_average=False, length_average=False)
    # size_average (bool): normalize the loss by the batch size (default: False)
    # length_average (bool): normalize the loss by the total number of frames in the batch. If True, supersedes size_average (default: False)

forward(acts, labels, act_lens, label_lens)
    # acts: Tensor of (seqLength x batch x outputDim) containing output activations from network (before softmax)
    # labels: 1 dimensional Tensor containing all the targets of the batch in one large sequence
    # act_lens: Tensor of size (batch) containing size of each output sequence from the network
    # label_lens: Tensor of (batch) containing label length of each example

The error is caused by a code bug in calculating the label length in a batch of labels. When I correct this error, the cublas error disappears.