Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?
Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?
So this is an involved question but as a baseline I approach these problems like this
- Setup a benchmark so you can figure out what to improve
- Profile using PyTorch profiler to find bottlenecks or bugs
- Isolate source of problem (is it preprocessing, training, something else)
- If training then look into smaller models, quantization, larger batch sizes, distillation, pruning - benchmark and see what works best (this is why a benchmark script is important)
- If using this for inference use a runtime like IPEX, TensorRT, Torschscript ORT
- If none of the above works which it should most times then time to write your own kernels