Hi folks, I don’t have sudo access and contacting sys-admin takes a non trivial amount of time.
I’ve access to two remote clusters.
Cluster 1 :
output of a script which prints PyTorch version, CUDA version (if applicable) - otherwise prints
CUDA not available, OS, python version
PyTorch version: 2.2.2+cu118
/home/user_name/anaconda3/envs/llm2/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
CUDA Available: False
CUDA not available
System OS: Linux 4.18.0-514.el8.x86_64
Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
nvcc -V returns
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
nvidia-smi
> +-----------------------------------------------------------------------------------------+
> | NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
> |-----------------------------------------+------------------------+----------------------+
> | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
> | | | MIG M. |
> |=========================================+========================+======================|
> | 0 NVIDIA RTX A6000 Off | 00000000:1C:00.0 Off | Off |
> | 30% 33C P8 19W / 300W | 23MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 1 NVIDIA RTX A6000 Off | 00000000:1E:00.0 Off | Off |
> | 30% 33C P8 20W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 2 NVIDIA RTX A6000 Off | 00000000:3D:00.0 Off | Off |
> | 30% 32C P8 27W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 3 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
> | 30% 35C P8 25W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 4 NVIDIA RTX A6000 Off | 00000000:3F:00.0 Off | Off* |
> |ERR! 49C P5 ERR! / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 5 NVIDIA RTX A6000 Off | 00000000:40:00.0 Off | Off |
> | 30% 32C P8 8W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 6 NVIDIA RTX A6000 Off | 00000000:41:00.0 Off | Off |
> | 30% 31C P8 16W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
> | 7 NVIDIA RTX A6000 Off | 00000000:5E:00.0 Off | Off |
> | 30% 29C P8 7W / 300W | 11MiB / 49140MiB | 0% Default |
> | | | N/A |
> +-----------------------------------------+------------------------+----------------------+
>
> +-----------------------------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |=========================================================================================|
> | 0 N/A N/A 4216 G /usr/libexec/Xorg 9MiB |
> | 0 N/A N/A 4466 G /usr/bin/gnome-shell 4MiB |
> | 1 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 2 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 3 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 4 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 5 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 6 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> | 7 N/A N/A 4216 G /usr/libexec/Xorg 4MiB |
> +-----------------------------------------------------------------------------------------+
On the 2nd cluster
output of a script which prints PyTorch version, CUDA version (if applicable) - otherwise prints
CUDA not available, OS, python version
PyTorch version: 2.2.2
/home/user_name/.conda/envs/llm/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /opt/conda/conda-bld/pytorch_1711403380909/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
CUDA Available: False
CUDA not available
System OS: Linux 3.10.0-1160.36.2.el7.x86_64
Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:18:00.0 Off | N/A |
| 31% 32C P8 1W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:3B:00.0 Off | N/A |
| 31% 32C P8 9W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:86:00.0 Off | N/A |
| 31% 35C P8 19W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:AF:00.0 Off | N/A |
| 31% 34C P8 1W / 250W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+```