Cannot find GPU on A100 in Singularity

charliezchen · February 18, 2023, 4:06pm

After extensive research and consulting of Chat-GPT (which is not quite helpful in my opinion by the way), I still cannot resolve the problem, so here I am.

I am working on a HPC cluster and the environment is configured in Singularity. I am able to use the GPU normally when I requested a RTX8000, but when I requested an A100, the torch cannot find the GPU anymore. Online posts basically tell me to install the torch version with cuda correctly, but my concern here is that since the environment is in Singularity, I would want it to be agnostic to the GPU architecture. A straightforward way to make it work on A100 is to simply reinstall the torch or create a new conda env for the new GPU, but I don’t think this solution is elegant (this loses the significance of using singularity afterall). The driver versions for RTX8000 and A100 are the same, so I am guessing maybe using the same environment on two different architecture may be possible.

Some attempts I have tried:

Using ldd to find where does the dependency of the dynamic library break
To my surprise, the dependencies of the libtorch.so are primarily in the python site-packages. I thought it should be linked to the system cuda .so files. I added the cuda path on the machine to the LD_LIBRARY_PATH before I initiate the singularity container. The libcublas.so is now linked to the so file in /usr/local/cuda-11.6/lib64/, but still it doesn’t work. (BTW, in RTX8000, it seems that only libcublas are linked to the /usr/local path)

print(torch.cuda.is_available())
print(torch.backends.cudnn.enabled)
torch.rand(1, device=“cuda”)

False
True
Runtime Error: GPU not found (paraphrased)

torch.utils.collect_env

====

PyTorch version: 1.12.0+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.28.1.el8_4.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.12.0+cu116
[pip3] torchaudio==0.12.0+cu116
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.13.0+cu116
[conda] numpy 1.22.4 pypi_0 pypi
[conda] torch 1.12.0+cu116 pypi_0 pypi
[conda] torchaudio 0.12.0+cu116 pypi_0 pypi
[conda] torchmetrics 0.9.3 pypi_0 pypi
[conda] torchvision 0.13.0+cu116 pypi_0 pypi

====

I am so confused by the situation and resorted to use RTX8000 for now. I would be really grateful if anyone can provide me some insights on how to make this work or how does torch cuda work under the hood (like how does torch find the driver code for the GPU, I assume it is dynamic library linkage, but I am not quite familiar with this). If I miss any important information for this issue, please kindly let me know so I can supplement them.

ptrblck · February 18, 2023, 6:47pm

No, that’s not the case. The PyTorch binaries shipping with CUDA 11.8 support compute capabilities 3.7 - 9.0 (the binaries with CUDA 11.7 do not support sm_90 [Hopper architecture]).

This is expected since the pip wheels and conda binaries ship with their own CUDA dependencies. The CUDA runtime, cuDNN, NCCL, and other libraries will be installed in the specified versions. Your locally installed CUDA toolkit is used to build custom CUDA extensions or PyTorch from source.

This sounds wrong since you are now trying to mix your system libraries into a pip wheel or conda binaries, which might have been built with another CUDA toolkit.

Since print(torch.cuda.is_available()) returns False it points to a CUDA init error, which might be caused by a broken driver installation or maybe you are just masking the GPU via CUDA_VISIBLE_DEVICES, which was the case recently for another user running into the same issue.

charliezchen · February 18, 2023, 8:47pm

Hi ptrblck,

The problem is solved (probably by the hpc administrator). Thanks for your response!

Best,

ptrblck · February 18, 2023, 8:58pm

Good to hear it was solved! Do you know what the issue was and what exactly fixed it?

charliezchen · February 18, 2023, 11:53pm

Unfortunately, I am not sure what solves the problem. I ran the collect_env again, and the only difference is that is_cuda_available is true now. It’s rather strange.