CUDA error when calling backward() on Tesla M2070

I have two systems, where the first has GeForce GTX 780 Ti with CUDA 8.0 (driver version: 375.26) and the other has Tesla M2070 with CUDA 7.5.18 (driver version: 352.99).

I installed both with blood-edge version on top of Python 3.6. (conda install -c soumith magma-cuda80 for the first machine and conda install -c soumith magma-cuda75 for the second machine)

I tested the following simple code:

import torch
from torch.autograd import Variable
a = Variable(torch.randn(3,4,5), requires_grad=True).cuda()
b = torch.randn(3,4,5).cuda()
a.backward(b)

The code works on the first machine but failed on the other machine as follows:

THCudaCheck FAIL file=/users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c line=65 error=46 : all CUDA-capable devices are busy or unavailable
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    a.backward(b)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 163, in backward
    return grad_output.cpu()
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 31, in cpu
    return self.type(getattr(torch, self.__class__.__name__))
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 276, in type
    return super(_CudaBase, self).type(*args, **kwargs)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 33, in _type
    return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

Since CUDA itself is working (no problems in cuda() methods before calling backward()), I wonder why this would happen on the second system.

Hi,

pytorch only supports compute capability >= 3.0
Unfortunately, the Tesla M2070 is a 2.0 compute capability card.

Oh. Sorry to hear that. Torch7 worked well on that machine without any problems.
Thanks!

You might try building form source but it would require some additional patches (look for closed issues in the main repo). But we don’t support them officially.

@apaszke I built from the most recent source as follows.

export CMAKE_PREFIX_PATH=/home/kimjook/anaconda3
conda install numpy mkl setuptools cmake gcc cffi
conda install -c soumith magma-cuda75
git clone https://github.com/pytorch/pytorch
cd pytorch
pip install -r requirements.txt
python setup.py install

Could you let me know which closed issues you are referring to? https://github.com/pytorch/pytorch/issues/665 seems somewhat related but there is no build error in my case.

Thanks! (I understand that supporting old devices is annoying, but I am somewhat frustrated since my almost the same model worked well on Torch7 doesn’t work on PyTorch.)

That’s the issue I was thinking about, but maybe you don’t need it for some reason.

I am little confused here. The official document says it needs NVIDIA GPU with compute capability >= 2.0
http://pytorch.org/docs/master/torch.html

We should update that part. I’m doing it now.

We started with the commitment of cc >= 2.0, but it has been infeasible, as 2.0 is simply too old and several newer APIs dont work on it.

1 Like