GH200 System RAM and VRAM

Abhinav_Reddy · July 3, 2024, 2:54am

When attempting to run a training run for a 7 billion parameter model, I receive this error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 344.00 MiB. GPU 0 has a total capacity of 95.00 GiB of which 238.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 92.96 GiB is allocated by PyTorch, and 667.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

When running torch.cuda.get_device_properties(0), torch returns CudaDeviceProperties(name='GH200 480GB', major=9, minor=0, total* memory=97280MB, multi_processor_count=132). Is the LPDDR5 memory not fully coherent with system memory? If so, is there anyway to expose it to pytorch without a custom allocator?

ptrblck · July 3, 2024, 12:16pm

No, you would need to use a custom allocator or custom memory pool as described in this WIP PR.