Cuda warning while running on cluster

phisanti · May 13, 2024, 2:21pm

I have been training my model locally to check that the code is properly implemented and now I am moving to the university cluster. Currently, they have the following cuda:

$ nvidia-smi
Mon May 13 16:11:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:2F:00.0 Off |                    0 |
| N/A   41C    P0              62W / 400W |  13267MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     10017      G   /usr/bin/X                                   23MiB |
|    0   N/A  N/A     85665      C   python                                    13220MiB |
+---------------------------------------------------------------------------------------+

and I have the following pytorch:

$conda env
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
opencv-python             4.9.0.80                 pypi_0    pypi
openssl                   3.0.13               h7f8727e_1    main
...
tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
...
torch                     2.3.0                    pypi_0    pypi

which was installed via conda install -c pytorch-nightly pytorch torchvision
However, when I launch the script, I got the following warning:


/xxx/xxx/xxx/xxx/.conda/envs/DeepLearning/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)

I am not sure if it is something a big issue as the code seems to run. Therefore, I was wondering if this is a big issue or might be slowing down the training process.

This is the OS of my server:


$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

I should mention that I cannot control the CUDA version nor I have sudo rights.

Thanks in advance for any help or support.

ptrblck · May 13, 2024, 5:12pm

Check if your machine has a locally installed CUDA Toolkit and cuDNN. If so, remove it from the LD_LIBRARY_PATH temporarily as a workaround since PyTorch ships with its own CUDA and cuDNN runtime dependencies.

phisanti · May 14, 2024, 12:47pm

For some reason, I cannot find the CUDA toolkit there:

$ echo $LD_LIBRARY_PATH:/xxx/xxx/xxx/xxx/xxx/lib:/scicore/home/boeluc00/canomu0000/JetRawDPCore/lib

$ which cuda
(nothing comes here)
$ which nvidia-smi
/usr/bin/nvidia-smi
$ ls /usr/local/cuda
ls: cannot access '/usr/local/cuda': No such file or directory

Is there any other way I can find the location/export of CUDA? maybe directly from python?

ptrblck · May 14, 2024, 5:40pm

You could use LD_DEBUG=libs python script.py args to check which cuDNN is loaded. If it’s from a system path (not the Python env you are using) you could drop it from the LD_LIBRARY_PATH or check why it’s found.