PyTorch remains so many bugs, feeling tired to use this framework…
So, the problem is, I compiled pytorch from source, and set CUDNN to my own path,
there is nowhere exist another CUDNN or CUDA.
In the building log,
cuDNN version : 7.1.4, but when I run lstm, it given following error:
RuntimeError: cuDNN version mismatch: PyTorch was compiled against 7104 but linked against 7005.
So, can anyone help me ??
and, after build finished, I get the following warning:
warning: no library file corresponding to '/mnt/lustre/sjtu/users/mkh96/tools/cuda-9.0/lib64/libcudnn.so' found (skipping)
Could you provide the following informations please:
- Where is your cuda install? Is there both one global and one local? like
- What is the content of the
lib64 folder in it (or each install if you have more than one). Especially all the
ls -la libcudnn* files?
- What is the result of
echo $LD_LIBRARY_PATH ?
Thanks for your advise. I checked all the environment path, I’m pretty sure that I used the same CUDNN dir which is version 7.1.4 and I also build other frameworks with this path.
Following is the warning log I found:
CMake Warning at cmake/Modules_CUDA_fix/upstream/FindCUDA.cmake:1836 (add_library):
Cannot generate a safe runtime search path for target caffe2_gpu because
files in some directories may conflict with libraries in implicit
runtime library [libcudnn.so.7] in /mnt/lustre/sjtu/users/mkh96/tools/cuda-9.0/lib64 may be hidden by files in:
Some of these libraries may not be found correctly.
Call Stack (most recent call first):
-- Generating done
Manually-specified variables were not used by the project:
Above was part of the problem building log, /users/xxx/miniconda3/envs/pytorch/lib is my pytorch conda dir.
I use this python to compile pytorch source, but for Gods sake, I DO set CUDNN_LIB_DIR to another dir, because I have mxnet, tensorflow, kaldi, and so many frameworkds to use, all of them share a common environment path.
So, it’s my fault, I didn’t check the log carefully, I should not skip any warnings.
But I still want to mention: Why not use user set environment but search a confused dir ?
User set environment should be the first order, but in the log, the program didn’t take it.
This makes us can not trust your framework, we don’t know when or how some errors or bugs occur, and we would be really confused about those things, waste so many time.
I hope you considering it.
By the way, I set
export PATH=$HOME/miniconda3/envs/pytorch/bin:$PATH before run
python setup.py install
If this folder is ahead in your LD_LIBRARY path, that means that the conda env is active right?
Also you’re not supposed to add conda env bin to the path by hand ? I don’t use it so I’m not sure. But from what I remember, activating the env would do it for you (and handle the libraries properly as well).
CUDNN_LIB_DIR is used at compile time, this is why you get pytorch compiled for cudnn 7.1. At runtime, the cudnn binary is loaded from shared library to reduce binary size and more flexibility. The warning printed at compile time is here to notify you if the cudnn that is going to be loaded (according to LD_LIBRARY_PATH) is the same as the one used for compilation.
I encountered a similar issue “RuntimeError: cuDNN version mismatch: PyTorch was compiled against 7102 but linked against 7600”
How I understand this problem is that my code is compiled for cudnn 7.1.2 but try to run with cudnn 7.6.0. (Pls correct me if I am wrong) from your discussion.
I still haven’t solve this problem yet after referring to your discussion.
May I know more background knowledge like why you ask that three quesitons?
For my case:
I try to run my code under conda vitural env in a Ubuntu docker container.
In addition, I checked I have “/usr/local/cuda” but I am not sure if I also have in my local ptath(how can i check?). And I got both “cuda” and “cuda-9.0” in the folder “/usr/local/”. What’s this mean?
Content of the /usr/local/cuda/lib64 folder: “ls: cannot access ‘libcudnn*’: No such file or directory”. The same to cuda-9.0.
echo $LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
How can you find the building log?
I am not a conda expert but I think conda cudnn package is where you get your cudnn? Make sure that whatever this path is is properly linked to in the env variables? Also you can use “locate libcudnn.so” to find all the versions of cudnn on your system. Finally make sure to properly uninstall pytorch before installing from source as a binary install with an old cudnn might be hiding your source install.
ALso you can search for old installations like this
find . -name libcudnn* 2>/dev/null
For example I was having results for
find . -name libcudnn*8.0.5* 2>/dev/null
find . -name libcudnn*8.1.0* 2>/dev/null
Then I just
rm -R /usr/local/cuda-11.1.0 and let
11.2.0 and it found the correct library.