Could not load library libcudnn_cnn_train.so.8. But I'm sure that I have set the right LD_LIBRARY_PATH

Hello, I am learning Pytorch and I have some codes to run on Fashion MNIST. But when I ran these codes:

def train(epoch):
    model.train()
    train_loss = 0
    for data, label in train_loader:
        data, label = data.cuda(), label.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, label)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()*data.size(0)
    train_loss = train_loss/len(train_loader.dataset)
    print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch, train_loss))

for epoch in range(1, epochs+1):
    train(epoch)

Someting wrong happened:

Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib64/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[17], line 2
      1 for epoch in range(1, epochs+1):
----> 2     train(epoch)
      3     val(epoch)

Cell In[15], line 9, in train(epoch)
      7 output = model(data)
      8 loss = criterion(output, label)
----> 9 loss.backward()
     10 optimizer.step()
     11 train_loss += loss.item()*data.size(0)

File ~/app/anaconda3/envs/pytorch-cuda12.1/lib/python3.10/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/app/anaconda3/envs/pytorch-cuda12.1/lib/python3.10/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: GET was unable to find an engine to execute this computation

my setting is:

  • ubuntu 22.04
  • Nvidia 1080ti * 4
  • Nvidia driver version: 535.104.05
  • cuda versionnvcc -V): 12.1
  • torch version: 2.1.0
    • torchaudio version: 2.1.0
    • torchvision version: 0.16.0
  • python version: 3.10.12

well, I am sure I have installed the cudnn, and I set the right LD_LIBRARY_PATH in the .bashrc

# cuda version change
export PATH=/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH

Moreover, I can even see the libcudnn_cnn_train.so.8 right in the /usr/local/cuda-12.1/lib64, why can’t pytorch load it?

To make sure it’s not the case that I installed the cpuonly version pytorch, I did some test, and no problem.

import torch
print(torch.cuda.is_available())
print(torch.__version__)
print(torch.__path__)
print(torch.version.cuda)

# GPU
x = torch.randn(1, 3, 224, 224).cuda()
conv = torch.nn.Conv2d(3, 3, 3).cuda()

out = conv(x)
print(out.sum())
torch.backends.cudnn.version()

outcome:

True
2.1.0+cu121
['/home/pku/app/anaconda3/envs/pytorch-cuda12.1/lib/python3.10/site-packages/torch']
12.1
tensor(6512.6465, device='cuda:0', grad_fn=<SumBackward0>)
8904

Can you please help me with my problem?

The PyTorch binaries ship with their own CUDA dependencies (including cuDNN), so remove your locally installed cuDNN (temporarily) from the library path and let PyTorch load it’s own libs.
If you want to use your locally installed CUDA toolkit you could build PyTorch from source.

2 Likes

Thank you! It worked! Now I can run my code!
let me describe how I followed your advice to get this problem down:

cd /usr/local/cuda-12.1/lib64
sudo rm -f libcudnn*
cd /usr/local/cuda-12.1/include
sudo rm -f cudnn*

then, I run my code, and it worked!

To make things more clearly, I did some test:

import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.backends.cudnn.version())

before I remove the local cudnn as you say, the outcome is:

2.1.0+cu121
True
12.1
8904

After I removed the local cudnn,:

2.1.0+cu121
True
12.1
8902

I Noticed the cudnn version changed frome 8904 to 8902. Great, this means pytorch binaries do have the cudnn as their dependencies.
Moveover, I changed the cuda version to 12.2 in ~/.bashrc:

# cuda version change
export PATH=/usr/local/cuda-12.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH

then

source ~/.bashrc

Then I ran the codes, got:

2.1.0+cu121
True
12.1
8902

As we can see, the cuda version is still 12.1 not 12.2.
Amazing! Pytorch binaries also have cuda toolkit as their dependencies.

I even annotated the ~/.bashrc:

# cuda version change
# export PATH=/usr/local/cuda-12.2/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH

And I can still run my train code. Awesome, this means Pytorch doesn’t use my local cuda and cudnn to run its code, and it uses its own cuda and cudnn dependecies which will not be influenced by my local cuda and cudnn.

3 Likes

By the way, my complete installation step is:
→ I found my GPU is GTX 1080 ti
→ installed nvidia-derver 535.104.05

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

→ installed cuda toolkit 12.1

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

→ installed cudnn 9.8.4, and put all the cudnn9804/lib64 and cudnn9804/include into /usr/local/cuda-12.1

→ At last, in the virtual env, I installed Pytorch using pip 23.3 by:

pip install torch torchvision torchaudio

Then I ran my code, where came the error

Could not load libcudnn_cnn_train.so.8 /usr/local/cuda-12.1/lib64/libcudnn_cnn_train.so.8
```

Since Pytorch binaries have cuda and cudnn as their dependencies, can I simplify my installtion? like:
→ I found my GPU is NVIDIA GTX 1080ti
→ check the cuda version that can support my GPU
→ pip install torch=(the cuda version that can support my GPU)
→ run my code

Would this installation be correct? If it is right, Great! I don’t have to install Nvidia-driver、cuda toolkit and cudnn by myself anymore! (In my case, Installing cuda and cudnn locally by myself is even harmful, which let me can’t run my code, not to mention the time I wasted on installing them)

looking forward to your reply!

You only need to install an NVIDIA driver, not the full CUDA toolkit, to execute PyTorch binaries. Your locally installed CUDA toolkit would be used if you build PyTorch from source or a custom CUDA extension.

Now I have understanded the installation of pytorch, thank you!

Great!!! You really helped me out of a big trouble!

Hello!! I really appreciate for your work!!! You saved my time a lot:)

No, it’s not possible.

You mean I don’t need intall cuda to run torch2.2 now??

PyTorch binaries always shipped with the required CUDA dependencies and thus there was never a need to install a CUDA toolkit locally unless you want to build PyTorch from source or a custom CUDA extension.

But why I have nerver occured .so unfind issues before torch 2.1x?

Only got it when upgrade to latest.

Did you read through the thread, e.g. this post?

@ptrblck
new version pytorch 2.2 contains CUDA11.8 toolkit in anaconda env,
I have to delete cuda environment sush as /usr/local/cuda11.8
Otherwise I got the error
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-11.8/lib64/libcudnn_cnn_train.so.8: undefined symbol: _ZN10cask_cudnn20ScalarTypeProperties8fromNameEPKc, version libcudnn_cnn_infer.so.8

but if I delete cuda environment, I cannot use nerfacc library
: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.
How to set cuda environment

Remove the path to your locally installed CUDA toolkit including cuDNN from LD_LIBRARY_PATH only as a workaround if you execute your workload or remove cuDNN directly if it’s not needed. Deleting the CUDA toolkit is not needed.

If cudnn is installed on ubuntu and it is included when Pytorch is installed, will the pytorch code run using the cudnn library that comes with pytorch by default? If so, what mechanism does pytorch use to achieve this?