Cublas runtime error : an internal operation failed at /pytorch/torch/lib/

Hi,

I’ve been running a GAN model in pytorch available here, and it’s been running fine for ~9000 iterations, then suddenly breaks. Anyone has seen this error before ?

I’m running on pytorch 0.2 on 4 NVIDIA K80 GPUs in a docker environment that has python 2.7 and NVIDIA-Linux-x86_64-375.66 as driver (started using nvidia-docker).

RuntimeError: cublas runtime error : an internal operation failed at /pytorch/torch/lib/

The full error trace is below.

Thanks!

File “/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py”, line 156, in
backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables
)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 98, in
backward
variables, grad_variables, retain_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/function.py”, line 91, in
apply
return self._forward_cls.backward(self, *args)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/blas.py”, line
43, in backward
grad_matrix1 = torch.mm(grad_output, matrix2.t())
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py”, line 579, in
mm
return Addmm.apply(output, self, matrix, 0, 1, True)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/blas.py”, line
26, in forward
matrix1, matrix2, out=output)
RuntimeError: cublas runtime error : an internal operation failed at /pytorch/torch/lib/
THC/THCBlas.cu:246

try sudo rm -rf ~/.nv

1 Like

Thanks for the reply! I did try that based on some discussions I saw on other topics/forums.

I can’t say for sure whether it helped or not; I still got the same error thrown, albeit later in the training process.

hi I got a similair mistake on the torch.mm, have you get it fixed and how?

The method by guoqiagn_Wei does not work for me.