Cuda warning while running on cluster

I have been training my model locally to check that the code is properly implemented and now I am moving to the university cluster. Currently, they have the following cuda:

$ nvidia-smi
Mon May 13 16:11:53 2024       
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:2F:00.0 Off |                    0 |
| N/A   41C    P0              62W / 400W |  13267MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A     10017      G   /usr/bin/X                                   23MiB |
|    0   N/A  N/A     85665      C   python                                    13220MiB |

and I have the following pytorch:

$conda env
nvidia-cublas-cu12                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12                 pypi_0    pypi
nvidia-cufft-cu12                pypi_0    pypi
nvidia-curand-cu12               pypi_0    pypi
nvidia-cusolver-cu12               pypi_0    pypi
nvidia-cusparse-cu12               pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
opencv-python                    pypi_0    pypi
openssl                   3.0.13               h7f8727e_1    main
tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
torch                     2.3.0                    pypi_0    pypi

which was installed via conda install -c pytorch-nightly pytorch torchvision
However, when I launch the script, I got the following warning:

/xxx/xxx/xxx/xxx/.conda/envs/DeepLearning/lib/python3.9/site-packages/torch/autograd/ UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)

I am not sure if it is something a big issue as the code seems to run. Therefore, I was wondering if this is a big issue or might be slowing down the training process.

This is the OS of my server:

$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID_LIKE="rhel fedora"
PRETTY_NAME="CentOS Linux 7 (Core)"


I should mention that I cannot control the CUDA version nor I have sudo rights.

Thanks in advance for any help or support.

Check if your machine has a locally installed CUDA Toolkit and cuDNN. If so, remove it from the LD_LIBRARY_PATH temporarily as a workaround since PyTorch ships with its own CUDA and cuDNN runtime dependencies.

For some reason, I cannot find the CUDA toolkit there:

$ echo $LD_LIBRARY_PATH:/xxx/xxx/xxx/xxx/xxx/lib:/scicore/home/boeluc00/canomu0000/JetRawDPCore/lib

$ which cuda
(nothing comes here)
$ which nvidia-smi
$ ls /usr/local/cuda
ls: cannot access '/usr/local/cuda': No such file or directory

Is there any other way I can find the location/export of CUDA? maybe directly from python?

You could use LD_DEBUG=libs python args to check which cuDNN is loaded. If it’s from a system path (not the Python env you are using) you could drop it from the LD_LIBRARY_PATH or check why it’s found.