Runtime error cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Sreetama_Das · April 23, 2023, 3:46pm

Hi, I am sorry for repeating this issue which has been posted here many time before. I am getting the following error:

File "/home/sd/anaconda3/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

I have tried the previous answers here RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED - #2 by ptrblck. But when trying to install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0, I get the following error:

ERROR: No matching distribution found for torch==1.8.0+cu111

To make sure that I am using compatible versions of all the packages, I am listing it below.

python: 3.10.9
cuda compilation tools: 10.1
torch: 2.0.0+cu117

Also, here is the GPU details:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN RTX Off | 00000000:03:00.0 Off | N/A |
| 41% 41C P8 24W / 280W | 0MiB / 24217MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro K620 Off | 00000000:A1:00.0 Off | N/A |
| 48% 59C P0 3W / 30W | 450MiB / 2002MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 151344 C python 447MiB |
±----------------------------------------------------------------------------+

Is there anything wrong with the versions which are causing this error? Any help is very much appreciated.

ptrblck · April 23, 2023, 6:53pm

Could you post a minimal and executable code snippet reproducing the issue, please?
Also, it seems you are using your Quadro K620 which has only ~2GB of memory in stead of the TITAN RTX with ~24GB. In this case you could easily run out of memory, which could also raise this error message if cuDNN fails to initialize its handle.

Sreetama_Das · April 23, 2023, 10:07pm

HI, thanks for the reply. I do not have a minimal code snippet reproducing the error, but I am getting the error just by running the exact code in here Transfer Learning for Computer Vision Tutorial — PyTorch Tutorials 2.0.0+cu117 documentation
May be you can just copy the code run it for a check, if not a problem.

ptrblck · April 24, 2023, 3:17am

The code works for me using torch==2.0.0+cu118 on a 3090 and I still think you might be running out of memory on the K620 as also nvidia-smi indicates a Python process is running on this GPU.

Sreetama_Das · April 24, 2023, 2:19pm

Hi, the code has the following line:
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)

which means TITAN RTX will be used, isn"t it?

However, I have changed it to “cuda:1” and the error still appears.

Sreetama_Das · April 24, 2023, 4:20pm

Also, conda list gives me the following packages:

nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cublas-cu12        12.1.0.26                pypi_0    pypi
nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.55                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.0.131                pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi

Is the error occurring because two different versions are present?

ptrblck · April 24, 2023, 11:57pm

Yes, this could be the case. How did you install PyTorch and did you manually install any nvidia-* packages? Note that this use case is not supported as it can easily break your environment.
Especially since you are now mixing libraries coming from two different CUDA major releases (11 vs 12).

Sreetama_Das · April 25, 2023, 10:28pm

Actually when I got access to the GPU, pytorch was already installed in it. I think I may have accidentally installed some other versions of nvidia-* while installing some other package. Can you please suggest which of the above packages should I remove?

ptrblck · April 26, 2023, 12:58am

I would uninstall all PyTorch and nvidia-* packages and install a single binary with the desired CUDA version. Alternatively, you could also create a new and empty virtual environment and install PyTorch there.

Sreetama_Das · April 29, 2023, 2:40pm

Thank you, I installed pytorch in a new environment and it works now.

ptrblck · April 29, 2023, 6:54pm

Good to hear it’s working now and thanks for the update.