Reserving gpu memory?

Fisher · September 16, 2018, 5:37pm

Hi PyTorch Forum,

I have access to a server with a NVIDIA K80. Problem is, there are about 5 people using this server alongside me. Most of the others use Tensorflow with standard settings, which means that their processes allocate the full gpu memory at startup.
I use PyTorch, which dynamically allocates the memory it needs to do the calculation.
Here the problem scenario:
1.) I start my process, which will be running for about 7 days.
2.) Two days later somebody decides to start his tensorflow process.
3.) If my process needs more memory for some calc it will raise the cuda out of mem exception, cause the other process has allocated all of the free memory left…

So I was thinking: Is there a way to reserve lets say 3 GB GPU memory for my process which PyTorch can use dynamically, while the other users see my process consuming this space permanently?

Edit: My best idea so far is a little trick:
On starting you init 6 variables each containing 0.5 GB of random data. Each time you get an out of mem exception you delete one of these variables and try again where u got the exception. In that way u got like a
3 GB buffer.
But this kinda solutiuon ist really ugly in my optinon…

Fisher · September 17, 2018, 6:04am

Ok, I found a solution that works for me:

On startup I measure the free memory on the GPU, take 80% of that, create a variable this big and put it on GPU. Directly after doing that, I override it with a small value.
While the process is running, the GPU has still 80% memory blocked and pytorch is using this space.

Code:

import os
import torch

def check_mem():
    
    mem = os.popen('"<path\to\NVSMI>\nvidia-smi" --query-gpu=memory.total,memory.used --format=csv,nounits,noheader').read().split(",")
    
    return mem

def main():
    
    total, used = check_mem()
    
    total = int(total)
    used = int(used)
        
    max_mem = int(total * 0.8)
    block_mem = max_mem - used
        
    x = torch.rand((256,1024,block_mem)).cuda()
    x = torch.rand((2,2)).cuda()

    #do things here

albanD · September 17, 2018, 9:19am

Hi,

First of all, this is more of a cluster sharing problem from my point of view than a real need.

Anyway, your solution to allocate a tensor then delete it will work because the caching allocator will keep the memory around for the next allocations. You don’t need to replace it, you can only do del x just after creating it.
Be aware that this can have some side effect of possibly increasing the overall memory usage of your program and that as soon as your program will be close to run out of memory, the allocator will free all unused memory and your “memory pool” will be gone.

Fisher · September 17, 2018, 5:19pm

Thank you for this hint!

I agree with you, that this isnt PyTorch’ matter. Sadly our server support is really slow, so i needed some workaround

dragen · January 22, 2019, 4:23am

@albanD Thank you so much.
I adopted above code os.popen('"<path\to\NVSMI>\nvidia-smi" --query-gpu=memory.total,memory.used --format=csv,nounits,noheader').read().split(",") to query single specific GPU card when several cards available.

    deviceid = 1
    os.environ['CUDA_VISIBLE_DEVICES'] = "%d"%deviceid

    # ================================================
    total, used = os.popen(
        '"nvidia-smi" --query-gpu=memory.total,memory.used --format=csv,nounits,noheader'
            ).read().split('\n')[deviceid].split(',')
    total = int(total)
    used = int(used)

    print(deviceid, 'Total GPU mem:', total, 'used:', used)

dragen · January 24, 2019, 4:40am

@albanD @Fisher
I found above code will lead to out of memory error as :

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 3.95 GiB total capacity; 231.38 MiB already allocated; 6.25 MiB free; 2.52 GiB cached)

Why can not the pytorch re-use the cache memory?

albanD · January 24, 2019, 9:46am

As I mentionned above, this is a hack to try and prentent the memory is used. You should not need to do it and it can have side effects because this is not what the allocator is made for!
In particular, if any allocation fails due to fragmentation, we dealloc and realloc memory to reduce fragementation. But since the whole memory was allocated at once, this is not possible after this hack and so it will OOM even though it will work without it.
Here again, @dragen why do you need to do this?