CUDA error when calling backward() on Tesla M2070

supakjk · March 9, 2017, 3:40pm

I have two systems, where the first has GeForce GTX 780 Ti with CUDA 8.0 (driver version: 375.26) and the other has Tesla M2070 with CUDA 7.5.18 (driver version: 352.99).

I installed both with blood-edge version on top of Python 3.6. (conda install -c soumith magma-cuda80 for the first machine and conda install -c soumith magma-cuda75 for the second machine)

I tested the following simple code:

import torch
from torch.autograd import Variable
a = Variable(torch.randn(3,4,5), requires_grad=True).cuda()
b = torch.randn(3,4,5).cuda()
a.backward(b)

The code works on the first machine but failed on the other machine as follows:

THCudaCheck FAIL file=/users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c line=65 error=46 : all CUDA-capable devices are busy or unavailable
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    a.backward(b)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 163, in backward
    return grad_output.cpu()
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 31, in cpu
    return self.type(getattr(torch, self.__class__.__name__))
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 276, in type
    return super(_CudaBase, self).type(*args, **kwargs)
  File "/users/PAS0396/osu7806/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 33, in _type
    return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /users/PAS0396/osu7806/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

Since CUDA itself is working (no problems in cuda() methods before calling backward()), I wonder why this would happen on the second system.

albanD · March 9, 2017, 3:46pm

Hi,

pytorch only supports compute capability >= 3.0
Unfortunately, the Tesla M2070 is a 2.0 compute capability card.

supakjk · March 9, 2017, 4:04pm

Oh. Sorry to hear that. Torch7 worked well on that machine without any problems.
Thanks!

apaszke · March 9, 2017, 7:02pm

You might try building form source but it would require some additional patches (look for closed issues in the main repo). But we don’t support them officially.

supakjk · March 9, 2017, 11:31pm

@apaszke I built from the most recent source as follows.

export CMAKE_PREFIX_PATH=/home/kimjook/anaconda3
conda install numpy mkl setuptools cmake gcc cffi
conda install -c soumith magma-cuda75
git clone https://github.com/pytorch/pytorch
cd pytorch
pip install -r requirements.txt
python setup.py install

Could you let me know which closed issues you are referring to? https://github.com/pytorch/pytorch/issues/665 seems somewhat related but there is no build error in my case.

Thanks! (I understand that supporting old devices is annoying, but I am somewhat frustrated since my almost the same model worked well on Torch7 doesn’t work on PyTorch.)

apaszke · March 10, 2017, 3:25pm

That’s the issue I was thinking about, but maybe you don’t need it for some reason.

fazlerabbitanjil · June 29, 2017, 2:51pm

I am little confused here. The official document says it needs NVIDIA GPU with compute capability >= 2.0
http://pytorch.org/docs/master/torch.html

smth · July 3, 2017, 2:07am

We should update that part. I’m doing it now.

We started with the commitment of cc >= 2.0, but it has been infeasible, as 2.0 is simply too old and several newer APIs dont work on it.