TorchServe GPU memory sharing when loading big models

I’m loading a very big model into GPU and I have multiple workers, and to save memory I want them to share GPU memory since it will be immutable.

With TorchServe or Gunicorn, if I’m using CPU I can fork() those workers and on Linux it will be Copy On Write by default. As long as I keep them immutable the memory will be shared among workers.

I wonder what happens when I’m using GPU memory? There are some tips on PyTorch multiprocessing and sharing CUDA memory (Multiprocessing best practices — PyTorch 1.12 documentation), which talks about using spawn() instead of fork(), but I wonder if this is/how it is implemented in TorchServe (it obviously won’t work with Gunicorn any more)?

Thanks!

Unfortunately TorchServe does not yet support sharing GPU memory. It is however a top priority for the team so please follow updates on GitHub - pytorch/serve: Serve, optimize and scale PyTorch models in production release notes when we release this

Any way I can contribute? I’m a machine learning engineer with quite some experience on backend development. I have some personal interest in this and spare time.

Did some quick research and this seems quite doable if we load the model as immutable (to avoid trouble of locks) and use CUDA IPC.

Would be interesting to make Torch GPU multiprocessing work with Gunicorn too, as it’s very popular for ML Serving with Python.
(Although this may require more work as Gunicorn uses fork() internally. On Linux systems this shares memory with workers as long as it’s immutable so can be used for sharing the IPC handle. Win32 a different story though).

Sure yeah I’m always happy to see community contributions. Feel free to open a Github issue on pytorch/serve and we can discuss some concrete next steps