GH200 Cuda not available on pytorch

I’m trying to run torch on top of GPUs of a server which i’m root. Drivers seems installed correctly:

> nvcc --version
< nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_11:03:34_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
> nvidia-smi
< Tue Apr  2 18:28:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GH200 480GB                    On  | 00000009:01:00.0 Off |                    0 |
| N/A   27C    P0              79W / 900W |      4MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
> lscpu
< Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 72
  On-line CPU(s) list:  0-71
Vendor ID:              ARM
  Model name:           Neoverse-V2
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 72
    Socket(s):          1
    Stepping:           r0p0
    Frequency boost:    disabled
    CPU max MHz:        3510.0000
    CPU min MHz:        81.0000
    BogoMIPS:           2000.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc
                        dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp s
                        ve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
  L1d:                  4.5 MiB (72 instances)
  L1i:                  4.5 MiB (72 instances)
  L2:                   72 MiB (72 instances)
  L3:                   114 MiB (1 instance)
NUMA:
  NUMA node(s):         9
  NUMA node0 CPU(s):    0-71
  NUMA node1 CPU(s):
  NUMA node2 CPU(s):
  NUMA node3 CPU(s):
  NUMA node4 CPU(s):
  NUMA node5 CPU(s):
  NUMA node6 CPU(s):
  NUMA node7 CPU(s):
  NUMA node8 CPU(s):
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

but when i run

> import torch
> print(torch.cuda.is_available())
< False

what could be the problem?

Solutions i’ve already tried (and not working):

  1. reinstall torch
  2. downgrade cuda drivers from scratch following this link
  3. I also tried to run the previous python script into a container (nvcr.io/nvidia/pytorch) and seems working! However its not what i want, because i need to work direcly on host filesystem.

Thanks in advance! :slight_smile:

PyTorch binaries do not support ARM + CUDA yet, but we are working on it :slight_smile:
In the meantime, please use the NGC containers.

When I got it running on NGC, PyTorch can only see the HBM and not the rest of the memory space (LPDDR5). Do you know how to expose or use the full memory? Nvidia keeps saying it’s fully coherent and should show up as one unified memory, but it most certainly does not.

Refreshing this thread. I am a novice on the lower level api things, and had the same question as @az226, when running pytorch in the container it can only access the ~96 GB of VRAM and then throws an OOM error when exceeding that value.

I found this github thread: Allow oversubscription of GPU memory through cudaMallocManaged on cuBLAS builds for systems like GH200 · Issue #5026 · ggerganov/llama.cpp · GitHub which suggests the use of pageableMemoryAccess . I assume this can be handled somewhere in the NVIDIA container and it just hasnt been released yet?

@ptrblck any thoughts here?

Thanks,
Randy

I would also like to learn the correct way to do this.

You can customize the CUDA memory allocator PyTorch uses, as such:

new_alloc = torch.cuda.memory.CUDAPluggableAllocator('./alloc.so', 'my_malloc', 'my_free')
torch.cuda.memory.change_current_allocator(new_alloc)

where I just wrapped cudaMallocManaged as described in pytorch CUDA docs

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
    void *ptr;
    cudaMallocManaged(&ptr, size); // Use cudaMallocManaged for unified memory
    return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
    cudaFree(ptr);
}
}

This works in the sense that you can overload memory and it does run (you can monitor the non-GPU memory portion with htop, nvidia-smi does not show it), but my tests with this run so slowly that I don’t think it’s using the Grace coherent memory “properly”, somehow.

2 Likes

Thanks for this, I’m glad others are making progress beyond what I am able to do! If you find out anything more please come back and share

For a while i was using this UVM patch (feat: support CUDA Unified Memory · pytorch/pytorch@3dd29c3 · GitHub) for pytorch mentioned here: Support CUDA Unified Memory by 0x804d8000 · Pull Request #106200 · pytorch/pytorch · GitHub

Which then allows to set PYTORCH_CUDA_ALLOC_CONF=‘use_uvm:True’
Unfortunately the patch does not work anymore with the new pytorch versions (like 2.3.1)

There is a newer thread in pytorch issues regarding cudaMallocManaged (pytorch/issues/124296) which refers to “# [RFC] Mix and Match CUDA Allocators using Private Pools”

Thanks @ mrctndl for the provided code !

Is there any news how to use pageableMemoryAccess ?

1 Like

Got the UVM patch working again with torch >2.2.2. And will continue to use it since pytorch does not offer me other easy alternatives at the moment.

1 Like