I am trying to deploy a DL model where inference time is the biggest bottleneck compared to pre and post-processing. My users might hit my REST API with small-batch numbers. So if I serve each request as its own batch the performance suffers. In the ideal world, I would batch the requests from different users that come with a certain small timeframe and then run the inference on the GPU. Following this separate the results and send each user their response.
Torchserve seems to have what is called “optional request batching” but I would like to know if it is the same thing? If yes on what basis is this batching done?
If not I guess the only solution for me would be to do some dirty async work and roll my own solution.
Thanks!