This is part is just a tracking on the issue which I though I solved (but actually I didn’t), see the final question in the end.
part 1:
using torch 0.3.0 by installing torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
from https://download.pytorch.org/whl/cu80/torch_stable.html
Found GPU0 NVIDIA GeForce RTX 2080 Ti which requires CUDA_VERSION >= 8000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org
Is that a bug ? It says >= 8000 and says the actual version is 8000, that kinda fit.
Well with torch 3.1 it says:
/usr/local/lib/python2.7/site-packages/torch/cuda/__init__.py:95: UserWarning:
Found GPU0 NVIDIA GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org
warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))
So I guess it is a an old bug that was fixed.
I’m asking those in the context of a code which should run with torch 0.2.0, but I faced this issue 18622 and trying different torch versions to see if anyone can fit.
torch 0.4.0 behaves like torch 0.2.0 and like torch 0.3.x, it takes a couple of minute on the operator .cuda and then outputs the error.
part 2:
I figured it out that CUDA supports my GPU only from CUDA9.0 versions and above, so no workarounds for that part.
It seems that I should change my container for cuda9, but strangely I found that with the same container if I download any version of torch for cu9, tried both 0.3.0 and 1.0.0, then .cuda() runs much faster, but I didn’t installed on the container any cuda 9.0 drivers, in the path of cuda in the versions files it still shows that the version is 8 and on the host it is cuda11, but print torch.version.cuda shows
9 version. How this is possible and why does it work ?
root@2023553b1d7e:/# cat /usr/local/cuda/version.txt
CUDA Version 8.0.61
root@2023553b1d7e:/# cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
root@2023553b1d7e:/#
there is no way cuda 9 is on the container because it is:
nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04
I install the torch 0.3.0 from the wheel torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
from the cu9
storage on pytorch site, does it include somewhere in it the cuda9 which it prints, is it builtin (that would explain it perhaps) ?
Hm, just faced an error like this during a convolution:
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
Seems like it isn’t actually working without the right cuda drive on the container, so I’ll try the right container and will update on the issue.
Same error with the container 9.0-cudnn7-devel-ubuntu16.04
.
The version of cuda driver for this time is 9.0 as it said by inspecting cuda and cuda-9.0 folders and their version.txt files.
the full error log is:
Traceback (most recent call last):
File "/opt/.pycharm_helpers/pydev/pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/opt/project/script_pwc.py", line 72, in <module>
flo = net(im_all)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/opt/project/models/PWCNet.py", line 189, in forward
c11 = self.conv1b(self.conv1aa(self.conv1a(im1)))
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
The torch.version.cuda
is the same as the version in version.txt
which is 9.
This works with the cuda8 container and torch 0.2.0 (just the slow problem).
Why this upgrade didn’t work ?
It is just a convolution code that falls, not anything unique to the code itself.
Meanwhile I checked the exact cudnn version
was installed on this container and it is 7.6.4
while the cuda is CUDA 9.0.176
as was said, and they are compatible looking at the nvidia site
Well, in conclusion, on 0.2.0 The execution of the code, particularly the forward of the convolution works, with slow start (about 15-20 minutes for .cuda and one forward of deep network), but the question is still why it fails with cuda9.0 and torch 0.3.0 with CUDNN_STATUS_EXECUTION_FAILED
?
Just found that message from soumith
in github issue:
CUDA 9 and RTX 2080 Ti simply aren't compatible and dont play well togethere.
An older CuDNN version working is likely a side-effect rather than expectation.
Use CUDA10 and CUDA10 versions of CuDNN etc. for RTX 2080 which is Turing architecture
Must be it.