I’m trying to run torch on top of GPUs of a server which i’m root. Drivers seems installed correctly:
> nvcc --version
< nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_11:03:34_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
> nvidia-smi
< Tue Apr 2 18:28:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GH200 480GB On | 00000009:01:00.0 Off | 0 |
| N/A 27C P0 79W / 900W | 4MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
> lscpu
< Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Vendor ID: ARM
Model name: Neoverse-V2
Model: 0
Thread(s) per core: 1
Core(s) per socket: 72
Socket(s): 1
Stepping: r0p0
Frequency boost: disabled
CPU max MHz: 3510.0000
CPU min MHz: 81.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc
dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp s
ve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
L1d: 4.5 MiB (72 instances)
L1i: 4.5 MiB (72 instances)
L2: 72 MiB (72 instances)
L3: 114 MiB (1 instance)
NUMA:
NUMA node(s): 9
NUMA node0 CPU(s): 0-71
NUMA node1 CPU(s):
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):
NUMA node5 CPU(s):
NUMA node6 CPU(s):
NUMA node7 CPU(s):
NUMA node8 CPU(s):
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
but when i run
> import torch
> print(torch.cuda.is_available())
< False
what could be the problem?
Solutions i’ve already tried (and not working):
- reinstall torch
- downgrade cuda drivers from scratch following this link
- I also tried to run the previous python script into a container (nvcr.io/nvidia/pytorch) and seems working! However its not what i want, because i need to work direcly on host filesystem.
Thanks in advance!