Hello PyTorch community,
Suppose I have 10 different PyTorch models (classification, detection, embedding) and 10 GPUs. I would like to serve real-time image traffic on these models. We can assume a uniform traffic distribution for each model. What is the most efficient (low latency, high throughput) way?
- Deploy all 10 models onto each and every GPU (10 models on all 10 GPUs). This will probably incur context switch and cache miss costs. Memory management might also be costlier. Latency will be high. (?)
- Deploy a single model on each GPU (10 models on 10 GPUs). No context switch or cache miss costs. Can be good for uniform traffic. (?)
Any suggestions, insights, experience?