Effects of CUDA out of memory on shared GPU

Andrea_Rosasco · April 16, 2020, 4:39pm

I was given access to a remote workstations where I can use a GPU to train my model. Of course all the resources are shared and the GPU memory is often partially used by other people processes.
Does getting a CUDA out of memory error means I just made someone else’s process crash?

albanD · April 16, 2020, 4:41pm

It’s possible but not sure.
Basically whichever user asks for memory when it is full will see its process getting killed.
So it can happen to only one user, or all of them if they all happen to ask for memory at that time.

Andrea_Rosasco · April 17, 2020, 8:02am

Thanks for the answer! Anyway, that’s too bad… I don’t want to kill someone else’s model that may be training for a lot and I don’t want my model killed for the same reason! Have you ever worked in such scenario? Do you know of any tricks to prevent it?

ptrblck · April 17, 2020, 9:35am

I don’t think there is a good solution. Even if you would limit your code to e.g. 25% of the device memory, other processes might still want to use the whole GPU and thus creating an OOM error.