GPU requires CUDA evrsion which installed but gives an error

Ilya_Kotlov · August 13, 2022, 3:19pm

This is part is just a tracking on the issue which I though I solved (but actually I didn’t), see the final question in the end.

part 1:
using torch 0.3.0 by installing torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl from https://download.pytorch.org/whl/cu80/torch_stable.html

Found GPU0 NVIDIA GeForce RTX 2080 Ti which requires CUDA_VERSION >= 8000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org

Is that a bug ? It says >= 8000 and says the actual version is 8000, that kinda fit.

Well with torch 3.1 it says:

/usr/local/lib/python2.7/site-packages/torch/cuda/__init__.py:95: UserWarning: 
    Found GPU0 NVIDIA GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org
    
  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

So I guess it is a an old bug that was fixed.

I’m asking those in the context of a code which should run with torch 0.2.0, but I faced this issue 18622 and trying different torch versions to see if anyone can fit.

torch 0.4.0 behaves like torch 0.2.0 and like torch 0.3.x, it takes a couple of minute on the operator .cuda and then outputs the error.

part 2:
I figured it out that CUDA supports my GPU only from CUDA9.0 versions and above, so no workarounds for that part.
It seems that I should change my container for cuda9, but strangely I found that with the same container if I download any version of torch for cu9, tried both 0.3.0 and 1.0.0, then .cuda() runs much faster, but I didn’t installed on the container any cuda 9.0 drivers, in the path of cuda in the versions files it still shows that the version is 8 and on the host it is cuda11, but print torch.version.cuda shows 9 version. How this is possible and why does it work ?

root@2023553b1d7e:/# cat /usr/local/cuda/version.txt
CUDA Version 8.0.61
root@2023553b1d7e:/# cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
root@2023553b1d7e:/#

there is no way cuda 9 is on the container because it is:
nvidia/cuda:8.0-cudnn7-devel-ubuntu16.04

I install the torch 0.3.0 from the wheel torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl from the cu9 storage on pytorch site, does it include somewhere in it the cuda9 which it prints, is it builtin (that would explain it perhaps) ?

Hm, just faced an error like this during a convolution:

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Seems like it isn’t actually working without the right cuda drive on the container, so I’ll try the right container and will update on the issue.

Same error with the container 9.0-cudnn7-devel-ubuntu16.04.
The version of cuda driver for this time is 9.0 as it said by inspecting cuda and cuda-9.0 folders and their version.txt files.

the full error log is:

Traceback (most recent call last):
  File "/opt/.pycharm_helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/opt/project/script_pwc.py", line 72, in <module>
    flo = net(im_all)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/project/models/PWCNet.py", line 189, in forward
    c11 = self.conv1b(self.conv1aa(self.conv1a(im1)))
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
    self.padding, self.dilation, self.groups)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

The torch.version.cuda is the same as the version in version.txt which is 9.

This works with the cuda8 container and torch 0.2.0 (just the slow problem).

Why this upgrade didn’t work ?
It is just a convolution code that falls, not anything unique to the code itself.

Meanwhile I checked the exact cudnn version was installed on this container and it is 7.6.4 while the cuda is CUDA 9.0.176 as was said, and they are compatible looking at the nvidia site

Well, in conclusion, on 0.2.0 The execution of the code, particularly the forward of the convolution works, with slow start (about 15-20 minutes for .cuda and one forward of deep network), but the question is still why it fails with cuda9.0 and torch 0.3.0 with CUDNN_STATUS_EXECUTION_FAILED ?

Just found that message from soumith in github issue:

CUDA 9 and RTX 2080 Ti simply aren't compatible and dont play well togethere.
An older CuDNN version working is likely a side-effect rather than expectation.
Use CUDA10 and CUDA10 versions of CuDNN etc. for RTX 2080 which is Turing architecture

Must be it.

Ilya_Kotlov · August 13, 2022, 6:07pm

Well, that one closes the issue.

Just one last question, what actually happens when I run an incompatible torch 0.2.0 with cuda 8.0 and cudnn 7 on the RTX 2080 Ti (Turing) ?

Undefined behavior generally ? Like if it works for me even partially, I probably should be grateful.

I mean torch 0.2.0 with cuda8.0 with cudnn 7 on the container works (I see that the stuff is shown to be on the GPU looking on the output of .cuda() and etc.)

Another interesting thing that I’ve noticed now is that the authors of the code I’m trying to run used Pascal GPU with compute capability 6.x which is totally different from the compute capability for 2080 Ti, which is 7.5. Maybe the error of CUDNN is understandable in this context (with CUDA9) and the real miracle that it runs with CUDA8.

ptrblck · August 13, 2022, 7:08pm

Most likely the binary ships with the right GPU architecture so that that CUDA toolkit will not try to jit-compile PyTorch as already described in your cross-post.

Ilya_Kotlov:

Hm, just faced an error like this during a convolution:
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
Seems like it isn’t actually working without the right cuda drive on the container, so I’ll try the right container and will update on the issue.

This might have been a known and already fixed issue or a completely unsupported case depending on your device you are using.
Don’t use these ancient versions and upgrade to the latest PyTorch release (1.12.1).