cuDNN built against wrong CUDA version (10.0 instead of 9.0) when building from source -> CUDNN_STATUS_NOT_INITIALIZED

bernhardschaefer · December 3, 2019, 4:04pm

I have trouble building PyTorch with CUDA 9.0 from source.

Everything works when I install PyTorch 1.1.0 from conda with CUDA 9.0 using:

conda install pytorch==1.1.0 torchvision==0.3.0 cudatoolkit=9.0 -c pytorch

Then I tried to upgrade to PyTorch 1.3.1 by building from source, since the release only has prebuild PyTorch with CUDA 9.2.
Now I get a CUDNN_STATUS_NOT_INITIALIZED error (see “To Reproduce”).
PyTorch environment shows me the correct Cuda version (CUDA used to build PyTorch: 9.0.176).
However, PyTorch torch.__config__.show() tells me CuDNN 7.4.1 (built against CUDA 10.0), which is not what I want.

Related issues on CUDNN_STATUS_NOT_INITIALIZED did not help me.

So my question is: how can I specify a cuda target version for cudnn when building from source? Or is there anything else that I am missing?

To Reproduce

Steps to reproduce the behavior:

Install PyTorch v1.3.1 branch from source

cd ~/path/to/pytorch
git checkout v1.3.1
git submodule sync
git submodule update --init --recursive

python setup.py clean
rm -rf ~/.nv # https://github.com/pytorch/pytorch/issues/5942

conda create -n pytorch131 python=3.7
conda activate pytorch131
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c pytorch magma-cuda90

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install

Run the following python script

import torch
from torch import nn
print(torch.backends.cudnn.is_acceptable(torch.cuda.FloatTensor(1)))

m = nn.Conv2d(8, 13, 3, stride=2).cuda()
input = torch.randn(5, 8, 20, 30, device="cuda")
output = m(input)
print("success", output.shape)

Observe Error

True
Traceback (most recent call last):
  File "../test-cuda.py", line 13, in <module>
    output = m(input)
  File "/home/ubuntu/miniconda/envs/pytorch131/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda/envs/pytorch131/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/ubuntu/miniconda/envs/pytorch131/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Expected behavior

Environment

PyTorch version: 1.3.0a0+ee77ccb
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 396.44
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1

Versions of relevant libraries:
[pip] numpy==1.17.4
[pip] torch==1.3.0a0+ee77ccb
[conda] blas                      1.0                         mkl
[conda] magma-cuda90              2.5.0                         1    pytorch
[conda] mkl                       2019.4                      243
[conda] mkl-include               2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.15           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] torch                     1.3.0a0+ee77ccb          pypi_0    pypi

Additional context

I also printed the torch config using

import torch.__config__
print(torch.__config__.show())

Output:

PyTorch built with:
  - GCC 5.4
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201307 (a.k.a. OpenMP 4.0)
  - NNPACK is enabled
  - CUDA Runtime 9.0
  - NVCC architecture flags: -gencode;arch=compute_70,code=sm_70
  - CuDNN 7.4.1  (built against CUDA 10.0)
  - Magma 2.5.0
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math, FORCE_FALLBACK_CUDA_MPI=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

ptrblck · December 3, 2019, 5:18pm

How did you install cudnn locally? Did you use a .deb file and made sure it’s for the right CUDA version?

bernhardschaefer · December 4, 2019, 7:15am

CUDA and cudnn was installed by an admin.
I checked the debian installation, and now this CuDNN 7.4.1 (built against CUDA 10.0) message makes total sense to me:

dpkg -l | grep cudn
ii  libcudnn7                                                  7.4.1.5-1+cuda10.0                         amd64        cuDNN runtime libraries
ii  libcudnn7-dev                                              7.4.1.5-1+cuda10.0                         amd64        cuDNN development libraries and headers

I hadn’t checked that before because I assumed there can’t be a mismatch when PyTorch 1.1.0 binaries run on CUDA without any issues.
I installed an appropriate cudnn version and now it works, thanks a lot!

Out of curiousity: Do you know why the previous PyTorch 1.1.0 binary installation was working? Does the binary contain cudnn?

ptrblck · December 4, 2019, 7:18am

Yes, the binaries ship with CUDA, cudnn and other libraries, so that you just need the NVIDIA driver to get started.