Allocate all memory and reuse cache mem

dragen · January 24, 2019, 5:02am

Hi, our office has a sever and several people share these gpus.

However, I want to occupy a single card to prevent others affect my program.

the approach is to allocate all available memory at the begining and re-use these cached memory by pytorch as follows:

import os
import torch

def check_mem():
    
    mem = os.popen('"<path\to\NVSMI>\nvidia-smi" --query-gpu=memory.total,memory.used --format=csv,nounits,noheader').read().split(",")
    
    return mem

def main():
    
    total, used = check_mem()
    
    total = int(total)
    used = int(used)
        
    max_mem = int(total * 0.8)
    block_mem = max_mem - used
        
    x = torch.rand((256,1024,block_mem)).cuda()
   del x

    #do things here

However, above approach will surely lead to out of memory error on my machine as follows:

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 3.95 GiB total capacity; 395.42 MiB already allocated; 15.38 MiB free; 2.36 GiB cached)

It is because pytorch can not re-use these cached memory.

Any one can give me some tips to solve the problem?
Thank you.

albanD · January 24, 2019, 9:54am

Please avoid posting the same message multiple times…
I answered in the other post that asked the same question here.

The short answer is pytorch is not built to help handle the sharing of GPU but to use them as efficiently as possible. The hack that you use here is a hack and will have side effects. Like making your program OOM in cases where it sometimes does not.
If you want more details on why we can’t fix the memory fragmentation problem in pytorch, you can check this thread.