Does torchserve support batching of requests before inference?

I am trying to deploy a DL model where inference time is the biggest bottleneck compared to pre and post-processing. My users might hit my REST API with small-batch numbers. So if I serve each request as its own batch the performance suffers. In the ideal world, I would batch the requests from different users that come with a certain small timeframe and then run the inference on the GPU. Following this separate the results and send each user their response.

Torchserve seems to have what is called “optional request batching” but I would like to know if it is the same thing? If yes on what basis is this batching done?

If not I guess the only solution for me would be to do some dirty async work and roll my own solution.


Suppose several clients are making requests to torchserve in a batched setting. Torchserve has 2 important configurations the batch size and batch delay.

As requests come in they keep going into the same batch until either the batch size or batch delay is exceeded and then run a single batched inference

You shouldn’t need to build anything from scratch