Unified memory across NVSwitch'ed GPUs

Hi everyone,

Fortunately, it is becoming easier and easier to access NVSwitch’ed GPUs in commodity cloud environments, even with some pretty attractive spot pricing. My general understanding is that NVSwitch allows uniform memory access to all GPUs from all GPUs, so you can think of it as “one giant GPU” with a continuous memory space. CUDA unified memory allows for this, using CUDAMallocManaged (though the other “purpose” of CUDA unified memory is to seamlessly integrate with much slower, host memory). My question is this - is there a way to get PyTorch to, when on an NVSwitch system, treat all the memory as one big block? I know that of course its better to do manual data parallelism, and/or shard your model across GPUs if it’s too big, use DeepSpeed, etc. etc. but, for ease of use… wouldn’t it be nice if you could just access more and more memory going across multiple devices? The ease of use would definitely more than make up for a performance penalty (NVSwitch bandwidth still is not the same as on-device bandwidth but it is like relatively close…)… there are a lot of times where it would be nice to port single GPU code over to a bigger system only for the purpose of making the model a little bit bigger than one GPU, making the batch size a little bigger (without dealing with distributed data parallel), etc.

I have tried to recompile pytorch replacing CUDAMalloc with CUDAMallocManaged as in this paper:

But PyTorch still pins the memory to one device (and if it grows beyond that, it spills into CPU memory not onto the other GPUs).

So anyway figured I would ask folks here that have a more intimate knowledge of PyTorch’s allocation code to see if this was a quick fix?

Thanks everyone for your time!

No, NVSwitch uses a more sophisticated connectivity between the devices as described here. It doesn’t “automatically” create a single device.

That would be expected, as Unified memory is used.

thank you for the fast reply -
reading the link you sent - emphasis added:
It supports full all-to-all communication with direct GPU peer-to-peer memory addressing. These 16 GPUs can be used as a single high-performance accelerator with unified memory space and up to 10 petaFLOPS of deep learning compute power.
I guess I’m just wondering if that is possible for DL workloads with PyTorch…? Sounds like no? thank you!

I don’t think the “single accelerator” means it’s usable as a single device, but as an accelerated system due to the better interconnectivity.

Yes, you can use NVLink/NVSwitch through DistributedDataParallel or any other parallel approach.