Best way to deploy multiple models in one GPU

Chen_Li · July 31, 2023, 4:10am

I want to deploy 20+ models in one GPU. These models are expected to run in parallel. If the models are deployed in 20+ processes (that’s to say, one model per process), then the GPU complains out of memory (because the initialization step uses a lot of memory). If the models are deployed in one process, then they cannot run in parallel. What’s the best way to do this job?

ptrblck · July 31, 2023, 2:46pm

Using a single process won’t limit the parallelization of the models and you can still try to use different CUDA streams and launch different models. Depending on the used compute resources the models might be serialized, but this would be the same case for multiple processes.