I have two systems, where the first has GeForce GTX 780 Ti with CUDA 8.0 (driver version: 375.26) and the other has Tesla M2070 with CUDA 7.5.18 (driver version: 352.99).
I installed both with blood-edge version on top of Python 3.6. (conda install -c soumith magma-cuda80
for the first machine and conda install -c soumith magma-cuda75
for the second machine)
I tested the following simple code:
import torch
from torch.autograd import Variable
a = Variable(torch.randn(3,4,5), requires_grad=True).cuda()
b = torch.randn(3,4,5).cuda()
a.backward(b)
The code works on the first machine but failed on the other machine as follows:
THCudaCheck FAIL file=/users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c line=65 error=46 : all CUDA-capable devices are busy or unavailable
Traceback (most recent call last):
File "test.py", line 5, in <module>
a.backward(b)
File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 163, in backward
return grad_output.cpu()
File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 31, in cpu
return self.type(getattr(torch, self.__class__.__name__))
File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 276, in type
return super(_CudaBase, self).type(*args, **kwargs)
File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 33, in _type
return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65
Since CUDA itself is working (no problems in cuda()
methods before calling backward()
), I wonder why this would happen on the second system.