I am using a T4 GPU from AWS (g4dn.xlarge) and I have the following driver installed:
$ nvidia-smi
Thu Mar 25 12:08:12 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 27C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
I am using conda. If I create a new environment and install pytorch with pip install pytorch=1.8
, I get a CUDNN error:
$ CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout python test.py
cuda:0
I! CuDNN (v7605) function cudnnCreate() called:
i! Time: 2021-03-25T11:45:59.463353 (0d+0h+0m+4s since start)
i! Process=14769; Thread=14769; GPU=NULL; Handle=NULL; StreamId=NULL.
Traceback (most recent call last):
File "test.py", line 44, in <module>
output = net(t)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "test.py", line 20, in forward
x = self.conv1(x)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
However, if I create another environment but instead of using pip I use conda install pytorch=1.8.0
, then it works without problems.
This issue only happens for pytorch=1.8.0. If I use pip install torch==1.8.1
it works again.
I have looked at related issues on this forum, but can’t seem to figure out why conda install and pip install work yield different results.
Any clue on what might be the cause? Thanks
P.S. Here is what I ran on the machine to install cuda and cudnn:
# Install cudatoolkit 10.2
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
echo "export PATH=/usr/local/cuda-10.2/bin${PATH:+:${PATH}}" >> .bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" >> .bashrc
# Install CuDNN 7 and NCCL 2
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt update
sudo apt install -y libcudnn7=7.6.5.32-1+cuda10.2 libcudnn7-dev=7.6.5.32-1+cuda10.2
sudo apt autoremove
sudo apt upgrade
sudo ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/local/cuda-10.2/lib64/