Thoughts on use of CPU ram as a swap for GPU

ReyhaneAskari · February 2, 2018, 5:22pm

I want to use cpu RAM as a swap to GPU ram to allow oversubscription. Is there a way to implement this in cuda or C++?

As a first step, I tried to replace every cudaMalloc with cudaMallocManaged and tested it with a simple gemm on a pascal GPU. I increase the size of the matrix and check when I get a memory error. Right now both cases crash at the same time, so we don’t really have any speedup. So do you have any thoughts on this?

kouyoumin · September 5, 2019, 6:46am

Hi, tried something similar and it passed my simple test.

mtbrandy · September 5, 2019, 5:23pm

Large Model Support implements exactly what you describe:

Wiki: https://github.com/mtbrandy/pytorch/wiki/Large-Model-Support
Code: https://github.com/mtbrandy/pytorch/tree/v1.1.0-LMS

This is the contribution proposal for the LMS feature that is currently available in the Watson Machine Learning solution. For information about that solution, see https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_pytorch.html

Dale_Song · December 10, 2019, 2:52am

Is there more tech details about LMS feature?

mtbrandy · January 8, 2020, 8:38pm

You might be interested in the following blog post which walks through a usecase for PyTorch LMS and some performance considerations:

mohotmoz · November 21, 2021, 1:28am

@kouyoumin thank you so much for your post - I am trying to replicate with the most recent version of PyTorch and it looks like the spots where you edited the code are different now - any pointers as to where these changes could be made? (really excited to try this on our new NVSwitch’d A100s to see if the unified memory can truly be “one giant GPU” as NVIDIA claims with a contiguous address space…)

kouyoumin · November 29, 2021, 8:41am

@mohotmoz My idea was simple: replacing cudaMalloc with cudaMallocManaged. However, there’s still a check preventing cuda op from accessing tensors on another cuda device. So I modified the logic to allow that condition.
I have updated the code for recent changes of pytorch.