CUDNN_STATUS_NOT_INITIALIZED when installing pytorch with pip but not with conda

MiguelJaques · March 25, 2021, 1:31pm

I am using a T4 GPU from AWS (g4dn.xlarge) and I have the following driver installed:

$ nvidia-smi
Thu Mar 25 12:08:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I am using conda. If I create a new environment and install pytorch with pip install pytorch=1.8 , I get a CUDNN error:

$ CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout python test.py 
cuda:0

I! CuDNN (v7605) function cudnnCreate() called:
i! Time: 2021-03-25T11:45:59.463353 (0d+0h+0m+4s since start)
i! Process=14769; Thread=14769; GPU=NULL; Handle=NULL; StreamId=NULL.

Traceback (most recent call last):
  File "test.py", line 44, in <module>
    output = net(t)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "test.py", line 20, in forward
    x = self.conv1(x)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

However, if I create another environment but instead of using pip I use conda install pytorch=1.8.0, then it works without problems.
This issue only happens for pytorch=1.8.0. If I use pip install torch==1.8.1 it works again.

I have looked at related issues on this forum, but can’t seem to figure out why conda install and pip install work yield different results.

Any clue on what might be the cause? Thanks

P.S. Here is what I ran on the machine to install cuda and cudnn:

# Install cudatoolkit 10.2
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

echo "export PATH=/usr/local/cuda-10.2/bin${PATH:+:${PATH}}" >> .bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" >> .bashrc

# Install CuDNN 7 and NCCL 2
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt update
sudo apt install -y libcudnn7=7.6.5.32-1+cuda10.2 libcudnn7-dev=7.6.5.32-1+cuda10.2

sudo apt autoremove
sudo apt upgrade

sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.2/lib64/

surya00060 · March 25, 2021, 5:31pm

Hi,

You might need to specify the CUDA version explicitly along with PyTorch version while installing via Pip I guess. For example.

pip install torch==1.8.1+cu102

You can find the differences on conda and pip installation on official PyTorch page.

MiguelJaques · March 25, 2021, 5:40pm

Hey Surya,

The package torch==1.8.1+cu102 does not exist. According to the official page, the +cu appendix to the pytorch version only seems to apply for cuda 11, pip install torch==1.8.1+cu111, otherwise pip install torch is used.

surya00060 · March 25, 2021, 5:54pm

Oh I see. I’m not sure then. The error seemed to be mismatch between CUDA version of installed binary and the CUDA installed.

Maybe, you can download the required whl file from here and try if the error still exists.

MiguelJaques · March 25, 2021, 6:01pm

I added an edit above. This error only happens when using torch==1.8.0. Both torch==1.8.1 and torch==1.6.0 work without errors.

ptrblck · March 26, 2021, 9:01am

You are most likely running into this issue, which was solved in the 1.8.1 release.

sarahESL · May 19, 2022, 1:35pm

Had the same issue. Upgrading to torch 1.8.1 fixed it for me! Thanks.

Darshan_Gera · January 25, 2024, 2:39pm

Hi. I am running the code on 4 GPUs with distributed training with torch 1.8 and cuda 11.8.

conv_forward
return F.conv3d(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Same code works fine if I don’t use distributed training. Please let me know what could be the issue.

ptrblck · January 25, 2024, 2:47pm

Double post from here.