I want to use cpu RAM as a swap to GPU ram to allow oversubscription. Is there a way to implement this in cuda or C++?
As a first step, I tried to replace every cudaMalloc with cudaMallocManaged and tested it with a simple gemm on a pascal GPU. I increase the size of the matrix and check when I get a memory error. Right now both cases crash at the same time, so we don’t really have any speedup. So do you have any thoughts on this?
Large Model Support implements exactly what you describe:
This is the contribution proposal for the LMS feature that is currently available in the Watson Machine Learning solution. For information about that solution, see https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.1/navigation/wmlce_getstarted_pytorch.html
You might be interested in the following blog post which walks through a usecase for PyTorch LMS and some performance considerations:
@kouyoumin thank you so much for your post - I am trying to replicate with the most recent version of PyTorch and it looks like the spots where you edited the code are different now - any pointers as to where these changes could be made? (really excited to try this on our new NVSwitch’d A100s to see if the unified memory can truly be “one giant GPU” as NVIDIA claims with a contiguous address space…)
@mohotmoz My idea was simple: replacing cudaMalloc with cudaMallocManaged. However, there’s still a check preventing cuda op from accessing tensors on another cuda device. So I modified the logic to allow that condition.
I have updated the code for recent changes of pytorch.