After extensive research and consulting of Chat-GPT (which is not quite helpful in my opinion by the way), I still cannot resolve the problem, so here I am.
I am working on a HPC cluster and the environment is configured in Singularity. I am able to use the GPU normally when I requested a RTX8000, but when I requested an A100, the torch cannot find the GPU anymore. Online posts basically tell me to install the torch version with cuda correctly, but my concern here is that since the environment is in Singularity, I would want it to be agnostic to the GPU architecture. A straightforward way to make it work on A100 is to simply reinstall the torch or create a new conda env for the new GPU, but I don’t think this solution is elegant (this loses the significance of using singularity afterall). The driver versions for RTX8000 and A100 are the same, so I am guessing maybe using the same environment on two different architecture may be possible.
Some attempts I have tried:
-
Using ldd to find where does the dependency of the dynamic library break
To my surprise, the dependencies of thelibtorch.so
are primarily in the python site-packages. I thought it should be linked to the system cuda .so files. I added the cuda path on the machine to the LD_LIBRARY_PATH before I initiate the singularity container. Thelibcublas.so
is now linked to the so file in/usr/local/cuda-11.6/lib64/
, but still it doesn’t work. (BTW, in RTX8000, it seems that only libcublas are linked to the /usr/local path)
print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)
torch.rand(1, device=“cuda”)
False
True
Runtime Error: GPU not found (paraphrased)
- torch.utils.collect_env
====
PyTorch version: 1.12.0+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.28.1.el8_4.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.12.0+cu116
[pip3] torchaudio==0.12.0+cu116
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.13.0+cu116
[conda] numpy 1.22.4 pypi_0 pypi
[conda] torch 1.12.0+cu116 pypi_0 pypi
[conda] torchaudio 0.12.0+cu116 pypi_0 pypi
[conda] torchmetrics 0.9.3 pypi_0 pypi
[conda] torchvision 0.13.0+cu116 pypi_0 pypi
====
I am so confused by the situation and resorted to use RTX8000 for now. I would be really grateful if anyone can provide me some insights on how to make this work or how does torch cuda work under the hood (like how does torch find the driver code for the GPU, I assume it is dynamic library linkage, but I am not quite familiar with this). If I miss any important information for this issue, please kindly let me know so I can supplement them.