Best way to deploy multiple models in one GPU

I want to deploy 20+ models in one GPU. These models are expected to run in parallel. If the models are deployed in 20+ processes (that’s to say, one model per process), then the GPU complains out of memory (because the initialization step uses a lot of memory). If the models are deployed in one process, then they cannot run in parallel. What’s the best way to do this job?

Using a single process won’t limit the parallelization of the models and you can still try to use different CUDA streams and launch different models. Depending on the used compute resources the models might be serialized, but this would be the same case for multiple processes.