Hi,
I’ve been running a GAN model in pytorch available here, and it’s been running fine for ~9000 iterations, then suddenly breaks. Anyone has seen this error before ?
I’m running on pytorch 0.2 on 4 NVIDIA K80 GPUs in a docker environment that has python 2.7 and NVIDIA-Linux-x86_64-375.66 as driver (started using nvidia-docker).
RuntimeError: cublas runtime error : an internal operation failed at /pytorch/torch/lib/
The full error trace is below.
Thanks!
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py”, line 156, in
backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables
)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/init.py”, line 98, in
backward
variables, grad_variables, retain_graph)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/function.py”, line 91, in
apply
return self._forward_cls.backward(self, *args)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/blas.py”, line
43, in backward
grad_matrix1 = torch.mm(grad_output, matrix2.t())
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py”, line 579, in
mm
return Addmm.apply(output, self, matrix, 0, 1, True)
File “/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/blas.py”, line
26, in forward
matrix1, matrix2, out=output)
RuntimeError: cublas runtime error : an internal operation failed at /pytorch/torch/lib/
THC/THCBlas.cu:246