I’m loading a very big model into GPU and I have multiple workers, and to save memory I want them to share GPU memory since it will be immutable.
With TorchServe or Gunicorn, if I’m using CPU I can fork() those workers and on Linux it will be Copy On Write by default. As long as I keep them immutable the memory will be shared among workers.
I wonder what happens when I’m using GPU memory? There are some tips on PyTorch multiprocessing and sharing CUDA memory (Multiprocessing best practices — PyTorch 1.12 documentation), which talks about using spawn() instead of fork(), but I wonder if this is/how it is implemented in TorchServe (it obviously won’t work with Gunicorn any more)?