Deploying multiple PyTorch models on a single GPU

Hello PyTorch community,

Suppose I have 10 different PyTorch models (classification, detection, embedding) and 10 GPUs. I would like to serve real-time image traffic on these models. We can assume a uniform traffic distribution for each model. What is the most efficient (low latency, high throughput) way?

  1. Deploy all 10 models onto each and every GPU (10 models on all 10 GPUs). This will probably incur context switch and cache miss costs. Memory management might also be costlier. Latency will be high. (?)
  2. Deploy a single model on each GPU (10 models on 10 GPUs). No context switch or cache miss costs. Can be good for uniform traffic. (?)

Any suggestions, insights, experience?

Thanks!

I can tell that someone was watching Tesla AI day. :slight_smile:

You can deploy as many models as your gpu has memory to hold. However, during training, a model can require 4-5 times more gpu ram and equivalent increase in calculation time as your model is doing optimization. So you might be better allocating GPUs accordingly via a distributed training workflow. But make sure the size of all your GPUs can fit on the deployment GPU. One way to do this would be to test the 10 models on your GPU untrained and in eval mode. Granted, the inference wouldn’t be any good as you haven’t trained them, yet, but it will show if you need to tune model sizes accordingly. Then you can tweak model sizes before training. Make sure you have a memory buffer of 10-15% from the max memory.

The straightforward answer is have one model per GPU where the main issue will be underutilization and over-utilization of some since traffic will most certainly not be uniform which means you waste money.

Generally in this scenario you want a framework to figure out how to allocate multiple models to multiple GPUs and that’s one of the reasons we built torchserve a multi model inferencing framework to take care of this for you https://github.com/pytorch/serve

You would simply need to set some number of workers for each model depending on its importance to you or its speed after doing some profiling