How to Implement Asynchronous Request Handling in TorchServe for High-Latency Inference Jobs?

I’m currently developing a Rails application that interacts with a TorchServe instance for machine learning inference. The TorchServe server is hosted on-premises and equipped with 4 GPUs. We’re working with stable diffusion models, and each inference request is expected to take around 30 seconds due to the complexity of the models.

Given the high latency per job, I’m exploring the best way to implement asynchronous request handling in TorchServe. The primary goal is to manage a large volume of incoming prediction requests efficiently without having each client blocked waiting for a response.

Here’s the current setup and challenges:

  • Rails Application: This acts as the client sending prediction requests to TorchServe.
  • TorchServe Server: Running on an on-prem server with 4 GPUs.
  • Model Complexity: Due to stable diffusion processing, each request takes about 30 seconds.

I’m looking for insights or guidance on the following:

  1. Native Asynchronous Support: Does TorchServe natively support asynchronous request handling? If so, how can it be configured?
  2. Queue Management: If TorchServe does not support this natively, what are the best practices for implementing a queue system on the server side to handle requests asynchronously?
  3. Client-Side Implementation: Tips for managing asynchronous communication in the Rails application. Should I implement a polling mechanism, or are there better approaches?
  4. Resource Management: How to effectively utilize the 4 GPUs in an asynchronous setup to ensure optimal processing and reduced wait times for clients.

Any advice, experiences, or pointers to relevant documentation would be greatly appreciated. I’m aiming to make this process as efficient and scalable as possible, considering the high latency of each inference job.

Thank you in advance for your help!