Hi PyTorch Forum,
I have access to a server with a NVIDIA K80. Problem is, there are about 5 people using this server alongside me. Most of the others use Tensorflow with standard settings, which means that their processes allocate the full gpu memory at startup.
I use PyTorch, which dynamically allocates the memory it needs to do the calculation.
Here the problem scenario:
1.) I start my process, which will be running for about 7 days.
2.) Two days later somebody decides to start his tensorflow process.
3.) If my process needs more memory for some calc it will raise the cuda out of mem exception, cause the other process has allocated all of the free memory left…
So I was thinking: Is there a way to reserve lets say 3 GB GPU memory for my process which PyTorch can use dynamically, while the other users see my process consuming this space permanently?
Edit: My best idea so far is a little trick:
On starting you init 6 variables each containing 0.5 GB of random data. Each time you get an out of mem exception you delete one of these variables and try again where u got the exception. In that way u got like a
3 GB buffer.
But this kinda solutiuon ist really ugly in my optinon…