How to improve model latency for deployment

How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?

I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?

So this is an involved question but as a baseline I approach these problems like this

  1. Setup a benchmark so you can figure out what to improve
  2. Profile using PyTorch profiler to find bottlenecks or bugs
  3. Isolate source of problem (is it preprocessing, training, something else)
  4. If training then look into smaller models, quantization, larger batch sizes, distillation, pruning - benchmark and see what works best (this is why a benchmark script is important)
  5. If using this for inference use a runtime like IPEX, TensorRT, Torschscript ORT
  6. If none of the above works which it should most times then time to write your own kernels