Fortunately, it is becoming easier and easier to access NVSwitch’ed GPUs in commodity cloud environments, even with some pretty attractive spot pricing. My general understanding is that NVSwitch allows uniform memory access to all GPUs from all GPUs, so you can think of it as “one giant GPU” with a continuous memory space. CUDA unified memory allows for this, using CUDAMallocManaged (though the other “purpose” of CUDA unified memory is to seamlessly integrate with much slower, host memory). My question is this - is there a way to get PyTorch to, when on an NVSwitch system, treat all the memory as one big block? I know that of course its better to do manual data parallelism, and/or shard your model across GPUs if it’s too big, use DeepSpeed, etc. etc. but, for ease of use… wouldn’t it be nice if you could just access more and more memory going across multiple devices? The ease of use would definitely more than make up for a performance penalty (NVSwitch bandwidth still is not the same as on-device bandwidth but it is like relatively close…)… there are a lot of times where it would be nice to port single GPU code over to a bigger system only for the purpose of making the model a little bit bigger than one GPU, making the batch size a little bigger (without dealing with distributed data parallel), etc.
I have tried to recompile pytorch replacing CUDAMalloc with CUDAMallocManaged as in this paper:
But PyTorch still pins the memory to one device (and if it grows beyond that, it spills into CPU memory not onto the other GPUs).
So anyway figured I would ask folks here that have a more intimate knowledge of PyTorch’s allocation code to see if this was a quick fix?
Thanks everyone for your time!