Pytorch with CUDA Unified Memory

ticlazau1 · November 12, 2019, 8:21pm

Hello PyTorch fans,

has anyone compiled PyTorch with CUDA Unified Memory? Any documentation on how this can be done?

Thx,
FM

albanD · November 12, 2019, 10:55pm

Hi!

We do not support unified memory in pytorch.
There are only very narrow use case where it brings big improvements in practice unfortunately.
Is there any specific application you have that requires it?

ticlazau1 · November 13, 2019, 1:34am

Hi,

I have a problem with the I/O and I believe the cuda unified memory will help in this case. cudaMallocManaged call that would allocate a single pointer accessible by either the GPUs or the CPUs most probably can be a great help on a multi GPU system.

Rgds,
FM

albanD · November 13, 2019, 1:44am

It may simplify your code so that you don’t have to add .cuda() calls. But speed-wise, it is unclear that you will get any benefit.

ticlazau1 · November 13, 2019, 2:29am

Can this be tested, even experimental? I will appreciate even some build instructions.

Rgds,
FM

albanD · November 13, 2019, 3:05am

Sure, you can check this hackathon entry from this summer: Introducing SpeedTorch: 4x speed CPU->GPU transfer, 110x GPU->CPU transfer
Note that the copy is faster as no copy happens here.
Any actual operation on these Tensors will be significantly slower though as copy will happen and such slow downs are not measures in the benchmark of the submission.

Santosh-Gupta · November 13, 2019, 3:39am

Note that the copy is faster as no copy happens here.

What does this mean? I’m not super familiar with CUDA, despite being the author of that library lol

albanD · November 13, 2019, 3:49am

The copy happens much faster because for unified memory, the data is actually only moved when actually needed on one device. So if you just say, this is not on gpu. It makes a GPU Tensor that references a CPU Tensor (no copy happens here). Now if you access some values of that GPU Tensor in a kernel, then the data will be copied from the CPU to the GPU for the operation to be performed.
You couldn’t get 110x in transfer, the PCIe connection is the limit, and our current transfer speed is pretty close to it (not sure exactly how close).

ticlazau1 · November 13, 2019, 11:28am

Hi @albanD

This is not my case, because I am using an IBM AC922 where GPUs are connected to CPUs via NVLINK 2.0. An in addition the memory bandwidth to the CPUs is via 8x DDR4 channels (measured is around 220 GBs).

Similar research on this type of HW architecture you ca find here:
http://www.ieee-hpec.org/2018/2018program/index_htm_files/135.pdf

Rgds,
FM

albanD · November 13, 2019, 4:24pm

Hi,

I’m sorry I’m not sure to understand what you mean by “this is not my case”?

ticlazau1 · November 13, 2019, 6:58pm

@albanD
My GPUs are connected with NVLINK 2.0 150GB/s to the CPUs. This is to your statement “the PCIe connection is the limit, and our current transfer speed is pretty close to it (not sure exactly how close).”

So I assume that CUDA Unified Memory in Pytorch on my system architecture could have a slightly better benefit compared with the one you described.

Rgds,
FM

albanD · November 13, 2019, 7:15pm

Yes but in your diagram above, you can see that the onchip memory gives 900GB/s.
And since many operations we have these days are memory limited. The slowdown you will get can be significant.

That being said, it might depend a lot as well depending on your workload.
I would be interested to know what is the result with your workload.

Joan_Gibert · March 15, 2021, 1:33pm

An update on this. I found this paper which implements this approach in Tensorflow.

It uses quite big images for the training. Here is the paper in case anyone found it useful.

Any idea if this could be ported to PyTorch?

Slappy · August 28, 2022, 10:02pm

I found this paper saying they implemented PyTorch with Unified Memory. Unfortunately it is behind a paywall so I can’t find any details about it: Implementing CUDA Unified Memory in the PyTorch Framework | IEEE Conference Publication | IEEE Xplore