How to improve model latency for deployment

Mohamed_Zahran · August 20, 2021, 2:38pm

Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?

Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?

marksaroufim · February 10, 2022, 10:01pm

So this is an involved question but as a baseline I approach these problems like this

Setup a benchmark so you can figure out what to improve
Profile using PyTorch profiler to find bottlenecks or bugs
Isolate source of problem (is it preprocessing, training, something else)
If training then look into smaller models, quantization, larger batch sizes, distillation, pruning - benchmark and see what works best (this is why a benchmark script is important)
If using this for inference use a runtime like IPEX, TensorRT, Torschscript ORT
If none of the above works which it should most times then time to write your own kernels